Scalable way to keep track of user activity - sql-server

I’m working on an HR system and I need to keep a tracking record of all the views on the profile of a user, because each recruiter will have limited views on candidate profiles. My main concern is scalability of my approach, which is the following:
I currently created a table with 2 columns, the id of the candidate who was viewed and the id of the recruiter who viewed the candidate, each view only counts once, so if you see the same candidate again no record will be inserted.
Based on the number of recruiters and candidates in the database I can safely say that my table will grow very quick and to make things worst I have to query my table on every request, because I have to show in the UI the number of candidates that the recruiter has viewed. Which would be the best approach considering scalability?
I'll explain the case a little bit more:
We have Companies and every Company has many Recruiters.
ViewsAssigner_Identifier Table
Id: int PK
Company_Id: int FK NON-CLUSTERED
Views_Assigned: int NON-CLUSTERED
Date: date NON-CLUSTERED
CandidateViewCounts Table
Id: int PK
Recruiter_id: int FK NON-CLUSTERED ?
Candidate_id: int FK NON-CLUSTERED ?
ViewsAssigner_Identifier_Id: int FK NON-CLUSTERED ?
DateViewed: date NON-CLUSTERED
I will query a Select of all [Candidate_id] by [ViewsAssigner_Identifier_id]
We want to search by Company not by Recruiter, because all the Recruiters in the same company used the same [Views_Assigned] to the Company. In other words the first Recuiter who views the Candidate is going to be stored in "CandidateViewCounts" Table and the subsequents Recruitres who view the same candidate are not going to be stored.
Result:
I need to retrieve a list of all the [Candidate_Id] by [ViewsAssigner_Identifier_id] and then I can SUM all these Candidates Ids.
Query Example:
SELECT [Candidate_Id] FROM [dbo].[CandidateViewCounts] WHERE [ViewsAssigner_Identifier_id] = 1
Any recommendations?

If you think that each recruiter might view each candidate once, you're talking about a max of 60,000 * 2,000,000 rows. That's a large number, but they aren't very wide rows; as ErikE explained you will be able to get many rows on each page, so the total I/O even for a table scan will not be quite as bad as it sounds.
That said, for maintenance reasons, as long as you don't search by CandidateID, you may want to partition this table on RecruiterID. For example, your partition scheme could have one partition for RecruiterID between 1 and 2000, one partition for 2001 -> 4000, etc. This way you max out the number of rows per partition and can plan file space accordingly (you can put each partition on its own filegroup, separating I/O).
Another point is this: if you are looking to run queries such as "how many views on this candidate (and we don't care which recruiters)?" or "how many candidates has this recruiter viewed (and we don't care which candidates)?" then you may consider indexed views. E.g.
CREATE VIEW dbo.RecruiterViewCounts
WITH SCHEMABINDING
AS
SELECT RecruiterID, COUNT_BIG(*)
FROM dbo.tablename;
GO
CREATE UNIQUE CLUSTERED INDEX pk_rvc ON dbo.RecruiterViewCounts(RecruiterID);
GO
CREATE VIEW dbo.CandidateViewCounts
WITH SCHEMABINDING
AS
SELECT CandidateID, COUNT_BIG(*)
FROM dbo.tablename;
GO
CREATE UNIQUE CLUSTERED INDEX pk_cvc ON dbo.CandidateViewCounts(CandidateID);
GO
Now, these clustered indexes are expensive to maintain, so you'll want to test your write workload against them. But they should make those two queries extremely, extremely fast without having to seek into your large table and potentially read multiple pages for a very busy recruiter or a very popular candidate.

If your table is clustered on the RecruiterID you will have a very fast seek and in my opinion no performance issue at all.
In such a narrow table as you've described, finding out the profiles viewed for any one recruiter should require a single read 99+% of the time. (Assume fillfactor = 80 with minimal page splits; row width assuming two int columns = 16 bytes + overhead, call that 20 bytes; 8040 or so bytes per page; say they get 4 views at average 2.5 rows per recruiter = ballpark 128 recruiters per data page). The total number of rows in the table is irrelevant because it can seek into the clustered index. Yeah, it has to traverse the tree, but it is still going to be very fast. There is no better way so long as the views have to be counted once per candidate. If it were simply total views, you could keep a count instead.
I don't think you have much to worry about. If you are concerned that the system could grow to tens of thousands of request per second and you'll get some kind of limiting hotspot of activity, as long as the recruiters visiting at any one point in time do not coincidentally have sequential IDs assigned to them, you will be okay.
The big principle here is that you want to avoid anything that would have to scan the table top to bottom. You can avoid that as long as you always search by RecruiterID or RecruiterID, CandidateID. The moment you want to search by CandidateID alone, you will be in trouble without an additional index. Adding a nonclustered index on CandidateID will double the space your table takes (half for the clustered, half for the nonclustered) but that is no big deal. Then searching by CandidateID will be just as fast, because the nonclustered index will properly cover the query and no bookmark lookup will be required.
Update
This is a response to the substantially new information you provided in the update to your question.
First, your CandidateViewCounts table is named incorrectly. It's something more like CandidateFirstViewedByRecruiterAtCompany. It can only indirectly answer the question you have, which is about the Company, not the Recruiters, so in my opinion the scenario you're describing really calls for a CompanyCandidateViewed table:
CompanyID int FK
CandidateID int FK
PRIMARY KEY CLUSTERED (CompanyID, CandidateID)
Store the CompanyID of the recruiter who viewed the candidate, and the CandidateID. Simple! Now my original answer still works for you, simply swap RecruiterID with CompanyID.
If you really do want to track which recruiters viewed which candidates, then do so in a RecruiterCandidateViewed table (and store all recruiter->candidate views). That can be queried later or in a data warehouse. But your real-time OLTP needs will be met by the table described above.
Also, I would like to mention that it is possible you are putting identity columns in tables that that don't need them. You should avoid identity columns unless the column is going to be used as an FK in another table (and not always even then, as sometimes in proper data modeling in order to prevent possible denormalization you must use composite keys in FKs). For example, your ViewsAssigner_Identifier table seems to me to need some help (of course I don't have all the information here and could be off base). If the Company and the Date are what's most important about that table, make them together the clustered PK and get rid of the identity column if at all possible.

Related

Covering index including rowversion? Good or bad

I'm working on synchronizing clients with data for eventual consistency. The server will publish a list of database ids and rowversion/timestamp. Client will then request data with incorrect version number. The primary reason for inconsistent data is networking issues between broker nodes, split brain, etc.
When I read data from my tables, I request data based on a predicate that is not the primary key.
I iterate available regions to read data per region. This is my select:
SELECT DatabaseId, VersionTimestamp, OperationId
FROM TableX
WHERE RegionId = 1
Since this leads to an index scan per query, I'm wondering if a non-clustered index on my RegionId column, and include the selected columns in that index:
CREATE NONCLUSTERED INDEX [ID_TableX_RegionId_Sync]
ON [dbo].[TableX] ([RegionId])
INCLUDE ([DatabaseId],[VersionTimestamp],[OperationId])
VersionTimestamp is rowversion/timestamp column, and will of course change whenever a row is updated, so I'm wondering if it is a poor design choice to include this column in an index since it will need to be updated at every insert/update/delete?
Since this will result in n index scans, rather than n index seeks, it might be better to read all the data once, and then group by regionId and fill in empty lists of rows where a regionId doesn't have any data.
The real life scenario is a bit more complicated, as there are table relationships that will also have to be queried. I haven not yet looked at including one to many relationships in my version queries.
This is primarily about better understanding the impact of covering indexes and figuring out how to better use them. Since I am going to read all the data from the table in any case, it is probably cheaper to load them all at once. However, reading them as from the query above, it makes my code a lot cleaner for this simple no-relationship example alone.
Edit:
Alternative 2
Another option that came to mind, is creating a covering index on RegionId, and include my primary key (DatabaseId).
SELECT DatabaseId
FROM TableX WHERE RegionId=1
And then a new query where I select the needed columns WHERE DatabaseId IN(list, of, databaseId)
For the current scenario, there are only max thousands of rows in the table, and not in the millions. Network traffic for the two (x n) queries might most likely outweigh the benefits of using indexes, and be premature optimization.

Table design for scheduling calls

Say you have a call center that needs to do X amount of calls in a given day. Each customer can be called for Y amount of products that they need to pay. The business would like to assign the calls to the operators at the start of the day and then be able to evaluate daily the results of those calls.
For those reasons, I'm thinking about deviding the structure into two with one table with the customers and another with the products.
What I am having problem with is deciding the primary key to put in the master scheduling table. Since there will be a good number of records every day, part of me wants to write the data in sequence for each day:
DateOfRecords date
Sequence int
OperatorID int
CustomerID int
Rest of the columns…
With DateOfRecords and sequence being the primary key.
I know that many recommend establishing an integer as a primary key and then index on the other columns. Again, my question comes because of the amount of records to be saved daily given the fact that this will be a historic table.
Any suggestions?
You can have just one clustered key per table. In most cases - but not necessarily - this will be the primary key. There are some things to consider:
The clustered key is the physical table itself. It includes all columns automatically.
The best clustered key is one which will never get fragmented (e.g. a sequence). The worst clustered key is an ordinary Guid.
If there is a clustered key, it will serve as a look-up for other indexes. if the clustered key is bad (e.g. fragmenting), other indexes will get an impact.
If your clustered index is running in fragmentation, you will have to repair this from time to time. This needs a huge physical shifting of data and should be avoided if possible.
Normal indexes are rather slim. They cover - to put it easy - a sorted list, together with the PK for a fast access to the actual row (with all the columns).
So - to put all this together: You should have a clustered primary key index using a non-fragmenting column. Additionally you can use as many indexes as you need, covering (and including) columns you need in queries.
There can be good reasons for multi-column PKs, Sometimes this can be the best choice. But the general advise is: Use a sequence as the (clustered) PK and place additional indexes to support your queries.
This will not be the absolutely fastest approach in each and any case, but a very good serves-for-all approach.
And one more point: Many programming approaches are best supported, if the tables have some things in common. E.g. an ID column or something like InsertDateTime. This allows for more generic programming approaches.
UPDATE: To add to your comment...
In a comment above you state: "...and then having problems later on when there are millions of records later on". Such issues are best solved with partitioned tables, filtered indexes and/or archiving strategies...
UPDATE2: Good to think about sargability...
In your question you write "Since there will be a good number of records every day, part of me wants to write the data in sequence for each day". Most operations you do against a column will break the usage of indexes (read about sargability). But a CAST() of a DATETIME to a DATE will allow to filter for calendar days without performance losses. Something like WHERE CAST(InsertDateTime AS DATE) ='20190909' will work lightning fast with an index on InsertDateTime.

Create more than one non clustered index on same column in SQL Server

What is the index creating strategy?
Is it possible to create more than one non-clustered index on the same column in SQL Server?
How about creating clustered and non-clustered on same column?
Very sorry, but indexing is very confusing to me.
Is there any way to find out the estimated query execution time in SQL Server?
The words are rather logical and you'll learn them quite quickly. :)
In layman's terms, SEEK implies seeking out precise locations for records, which is what the SQL Server does when the column you're searching in is indexed, and your filter (the WHERE condition) is accurrate enough.
SCAN means a larger range of rows where the query execution planner estimates it's faster to fetch a whole range as opposed to individually seeking each value.
And yes, you can have multiple indexes on the same field, and sometimes it can be a very good idea. Play out with the indexes and use the query execution planner to determine what happens (shortcut in SSMS: Ctrl + M). You can even run two versions of the same query and the execution planner will easily show you how much resources and time is taken by each, making optimization quite easy.
But to expand on these a bit, say you have an address table like so, and it has over 1 billion records:
CREATE TABLE ADDRESS
(ADDRESS_ID INT -- CLUSTERED primary key ADRESS_PK_IDX
, PERSON_ID INT -- FOREIGN KEY, NONCLUSTERED INDEX ADDRESS_PERSON_IDX
, CITY VARCHAR(256)
, MARKED_FOR_CHECKUP BIT
, **+n^10 different other columns...**)
Now, if you want to find all the address information for person 12345, the index on PERSON_ID is perfect. Since the table has loads of other data on the same row, it would be inefficient and space-consuming to create a nonclustered index to cover all other columns as well as PERSON_ID. In this case, SQL Server will execute an index SEEK on the index in PERSON_ID, then use that to do a Key Lookup on the clustered index in ADDRESS_ID, and from there return all the data in all other columns on that same row.
However, say you want to search for all the persons in a city, but you don't need other address information. This time, the most effective way would be to create an index on CITY and use INCLUDE option to cover PERSON_ID as well. That way, a single index seek / scan would return all the information you need without the need to resort to checking the CLUSTERED index for the PERSON_ID data on the same row.
Now, let's say both of those queries are required but still rather heavy because of the 1 billion records. But there's one special query that needs to be really really fast. That query wants all the persons on addresses that have been MARKED_FOR_CHECKUP, and who must live in New York (ignore whatever checkup means, that doesn't matter). Now you might want to create a third, filtered index on MARKED_FOR_CHECKUP and CITY, with INCLUDE covering PERSON_ID, and with a filter saying CITY = 'New York' and MARKED_FOR_CHECKUP = 1. This index would be insanely fast, as it only ever cover queries that satisfy those exact conditions, and therefore has a fraction of the data to go through compared to the other indexes.
(Disclaimer here, bear in mind that the query execution planner is not stupid, it can use multiple nonclustered indexes together to produce the correct results, so the examples above may not be the best ones available as it's very hard to imagine when you would need 3 different indexes covering the same column, but I'm sure you get the idea.)
The types of index, their columns, included columns, sorting orders, filters etc depend entirely on the situation. You will need to make covering indexes to satisfy several different types of queries, as well as customized indexes created specifically for singular, important queries. Each index takes up space on the HDD so making useless indexes is wasteful and requires extra maintenance whenever the data model changes, and wastes time in defragmentation and statistics update operations though... so you don't want to just slap an index on everything either.
Experiment, learn and work out which works best for your needs.
I'm not the expert on indexing either, but here is what I know.
You can have only ONE Clustered Index per table.
You can have up to a certain limit of non clustered indexes per table. Refer to http://social.msdn.microsoft.com/Forums/en-US/63ba3877-e0bd-4417-a04b-19c3bfb02ac9/maximum-number-of-index-per-table-max-no-of-columns-in-noncluster-index-in-sql-server?forum=transactsql
Indexes should just have different names, but its better not to use the same column(s) on a lot of different indexes as you will run into some performance problems.
A very important point to remember is that Indexes although it makes your select faster, influence your Insert/Update/Delete speed as the information needs to be added to the index, which means that the more indexes you have on a column that gets updated a lot, will drastically reduce the speed of the update.
You can include columns that is used on a CLUSTERED index in one or more NON-CLUSTERED indexes.
Here is some more reading material
http://www.sqlteam.com/article/sql-server-indexes-the-basics
http://www.programmerinterview.com/index.php/database-sql/what-is-an-index/
EDIT
Another point to remember is that an index takes up space just like the table. The more indexes you create the more space it uses, so try not to use char/varchar (or nchar/nvarchar) in an index. It uses to much space in the index, and on huge columns give basically no benefit. When your Indexes start to become bigger than your table, it also means that you have to relook your index strategy.

Optimal Strategy to Resolve Performance in Search Operations - SQL Server 2008

I'm working on a mobile website which is growing in popularity and this is leading to growth in some key database tables - and we're starting to see some performance issues when accessing those tables. Not being database experts (nor having the money to hire any at this stage) we're struggling to understand what is causing the performance problems. Our tables are not that big so SQL Server should be able to handle them fine and we've done everything we know to do in terms of optimising our queries. So here's the (pseudo) table structure:
[user] (approx. 40,000 rows, 37 cols):
id INT (pk)
content_group_id INT (fk)
[username] VARCHAR(20)
...
[content_group] (approx. 200,000 rows, 5 cols):
id INT (pk)
title VARCHAR(20)
...
[content] (approx. 1,000,000 rows, 12 cols):
id INT (pk)
content_group_id INT (fk)
content_type_id INT (fk)
content_sub_type_id INT (fk)
...
[content_type] (2 rows, 3 cols)
id INT (pk)
...
[content_sub_type] (8 rows, 3 cols)
id INT (pk)
content_type_id INT (fk)
...
We're expecting those row counts to grow considerably (in particular the user, content_group, and content tables). Yes the user table has quite a few columns - and we've identified some which can be moved into other tables. There are also a bunch of indexes we've applied to the affected tables which have helped.
The big performance problems are the stored procedures we're using to search for users (which include joins to the content table on the content_group_id field). We have tried to modify the WHERE and AND clauses using various different approaches and we think we have got them as good as we can but still it's too slow.
One other thing we tried which hasn't helped was to put an indexed view over the user and content tables. There was no noticeable performance gain when we did this so we've abandoned that idea due to the extra level of complexity inherent in having a view layer.
So, what are our options? We can think of a few but all come with pros and cons:
Denormalise of the Table Structure
Add multiple direct foreign key constraints between the user and content tables - so there would be a different foreign key to the content table for each content sub type.
Pros:
Joining the content table will be more optimal by using its primary key.
Cons:
There will be a lot of changes to our existing stored procedures and website code.
Maintaining up to 8 additional foreign keys (more realistically we'll only use 2 of these) will not be anywhere near as easy as the current single key.
More Denormalisation of the Table Structure
Just duplicate the fields we need from the content table into the user table directly.
Pros:
No more joins for to the content table - which significantly reduces the work SQL has to do.
Cons
Same as above: extra fields to maintain in the user table, changes to SQL and website code.
Create a Mid-Tier Indexing Layer
Using something like Lucene.NET, we'd put an indexing layer above the database. This would in theory improve performance of all search and at the same time decrease the load on the server.
Pros:
This is a good long-term solution. Lucene exists to improve search engine performance.
Cons:
There will be a much larger development cost in the short term - and we need to solve this problem ASAP.
So those are the things we've come up with and at this stage we're thinking the second option is the best - I'm aware that denormalising has it's issues however sometimes it's best to sacrifice architectural purity in order to get performance gains so we're prepared to pay that cost.
Are there any other approaches which might work for us? Are there any additional pros and/or cons with the approaches I've outlined above which may influence our decisions?
non clustered index seek from the content table using the
content_sub_type_id. This is followed by a Hash Match on the
content_group_id against the content table
This description would indicate that your expensive query filters the content table based on fields from content_type:
select ...
from content c
join content_type ct on c.content_type_id = ct.id
where ct.<field> = <value>;
This table design, and the resulting problem you just see, is quite common actually. The problems arise mainly due to the very low selectivity of the lookup tables (content_type has 2 rows, therefore the selectivity of content_type_id in content is probably 50%, huge). There are several solutions you can try:
1) Organize the content table on clustered index with content_type_id as the leading key. This would allow the join to do range scans and also avoid the key/bookmark lookup for the projection completeness. As a clustered index change, it would have implications on other queries so it has to be carefully tested. The primary key on content would obviously have to be enforced with a non-clustered constraint.
2) Pre-read the content_type_id value and then formulate the query without the join between content and content_type:
select ...
from content c
where c.content_type_id = #contentTypeId;
This works only if the selectivity of content_type_id is high (many distinct values with few rows each), which I doubt is your case (you probaly have very few content types, with many entries each).
3) Denormalize content_Type into content. You mention denormalization, but your proposal of denormalizing content into users makes little sense to me. Drop the content_type table, pull in the content_type fields into the content table itself, and live with all the denormalization problems.
4) Pre-join in a materialized view. You say you already tried that, but I doubt that you tried the right materialized view. You also need to understand that only Enterprise Edition uses the materialized view index automatically, all other editions require the NOEXPAND hint:
create view vwContentType
with schemabinding
as
select content_type_id, content_id
from dbo.content c
join dbo.content_type_id ct on c.content_type_id = ct.content_type_id;
create unique clustered index cdxContentType on vwContentType (content_type_id, content_id);
select ...
from content c
join vwContentType ct with (noexpand)
on ct.content_id = c.content_id
where ct.content_type_id = #contentTypeId;
Solutions 2), 3) and 4) are mostly academic. Given the very low selectivity of content_type_id, your only solution that has a standing chance is to make it the leading key in the clustered index of content. I did not expand the analysis to content_Sub_type, but with only 8 rows I'm willing to bet it has the very same problem, which would require to push it also into the clustered index (perhaps as the second leading key).

Using a meaningless ID as my clustered index rather than my primary key

I'm working in SQL Server 2008 R2
As part of a complete schema rebuild, I am creating a table that will be used to store advertising campaign performance by zipcode by day. The table setup I'm thinking of is something like this:
CREATE TABLE [dbo].[Zip_Perf_by_Day] (
[CampaignID] int NOT NULL,
[ZipCode] int NOT NULL,
[ReportDate] date NOT NULL,
[PerformanceMetric1] int NOT NULL,
[PerformanceMetric2] int NOT NULL,
[PerformanceMetric3] int NOT NULL,
and so on... )
Now the combination of CampaignID, ZipCode, and ReportDate is a perfect natural key, they uniquely identify a single entity, and there shouldn't be 2 records for the same combination of values. Also, almost all of my queries to this table are going to be filtered on 1 or more of these 3 columns. However, when thinking about my clustered index for this table, I run into a problem. These 3 columns do not increment over time. ReportDate is OK, but CampaignID and Zipcode are going to be all over the place while inserting rows. I can't even order them ahead of time because results come in from different sources during the day, so data for CampaignID 50000 might be inserted at 10am, and CampaignID 30000 might come in at 2pm. If I use the PK as my clustered index, I'm going to run into fragmentation problems.
So I was thinking that I need an Identity ID column, let's call it PerformanceID. I can see no case where I would ever use PerformanceID in either the select list or where clause of any query. Should I use PerformanceID as my PK and clustered index, and then set up a unique constraint and non-clustered indexes on CampaignID, ZipCode, and ReportDate? Should I keep those 3 columns as my PK and just have my clustered index on PerformanceID? (<- This is the option I'm leaning towards right now) Is it OK to have a slightly fragmented table? Is there another option I haven't considered? I am looking for what would give me the best read performance, while not completely destroying write performance.
Some actual usage information. This table will get written to in batches. Feeds come in at various times during the day, they get processed, and this table gets written to. It's going to get heavily read, as by-day performance is important around here. When I fill this table, it should have about 5 million rows, and will grow at a pace of about 8,000 - 10,000 rows per day.
In my experience, you probably do want to use another INT Identity field as your clustered index key. I would also add a UNIQUE constraint to that one (it helps with execution plans).
A big part of the reason is space - if you use a 3 field key for your clustered index, you will have all 3 fields in every row of every non-clustered index on that table (as your clustered index row identifier). If you only plan to have a couple of indexes that isn't a big deal, but if you have a lot of them it can make a big difference. The more data per row, the more pages needed and the more IO you have.
Fragmentation is a very real issue that can cause major performance problems, especially as the table grows.
Having that additional cluster key will also mean writes will be faster for your inserts. All new rows will go to the end of your table, which means existing rows won't be touched or rearranged.
If you want to use those three fields as a FK in other tables, then by all means have them as your PK.
For the most part it doesn't really matter if you ever directly reference your clustered index key. As long as it is narrow, increasing, and unique you should be in good shape.
EDIT:
As Damien points out in the comments, if you will be filtering on single fields of your PK, you will need to have an index on each one (or always use the first field in the covering index).
On the information given (ReportDate, CampaignID, ZipCode) or (ReportDate, ZipCode, CampaignID) seem like better candidates for the clustered index than a surrogate key. Defragmentation would be a potential concern if the time taken to rebuild indexes became prohibitive but given the sizes I would expect for this table (10s or 1000s rather than 1,000,000s of rows per day) that seems unlikely to be an issue.
If I understood all you have written correctly you are opting out of natural clustering due to fragmentation penalties.
For this purpose you consider meaningless IDs which will:
avoid insert penalties for clustered index when inserting out of order batches (great for write performance)
guarantee that your data is fragmented for reads that put conditions on the natural key (not so good for read performance)
JNK point's out that fragmentation can be a real issue, however you need to establish a baseline against which you will measure and you need to establish if reading or writing is more important to you (or how important they are in measurable terms).
There's nothing that will beat a good test case - so finally that is the best recommendation I can give.
With databases it is often relatively easy to build scripts that will create real benchmarks with real workloads and realistic data quantities.

Resources