I'm working on a mobile website which is growing in popularity and this is leading to growth in some key database tables - and we're starting to see some performance issues when accessing those tables. Not being database experts (nor having the money to hire any at this stage) we're struggling to understand what is causing the performance problems. Our tables are not that big so SQL Server should be able to handle them fine and we've done everything we know to do in terms of optimising our queries. So here's the (pseudo) table structure:
[user] (approx. 40,000 rows, 37 cols):
id INT (pk)
content_group_id INT (fk)
[username] VARCHAR(20)
...
[content_group] (approx. 200,000 rows, 5 cols):
id INT (pk)
title VARCHAR(20)
...
[content] (approx. 1,000,000 rows, 12 cols):
id INT (pk)
content_group_id INT (fk)
content_type_id INT (fk)
content_sub_type_id INT (fk)
...
[content_type] (2 rows, 3 cols)
id INT (pk)
...
[content_sub_type] (8 rows, 3 cols)
id INT (pk)
content_type_id INT (fk)
...
We're expecting those row counts to grow considerably (in particular the user, content_group, and content tables). Yes the user table has quite a few columns - and we've identified some which can be moved into other tables. There are also a bunch of indexes we've applied to the affected tables which have helped.
The big performance problems are the stored procedures we're using to search for users (which include joins to the content table on the content_group_id field). We have tried to modify the WHERE and AND clauses using various different approaches and we think we have got them as good as we can but still it's too slow.
One other thing we tried which hasn't helped was to put an indexed view over the user and content tables. There was no noticeable performance gain when we did this so we've abandoned that idea due to the extra level of complexity inherent in having a view layer.
So, what are our options? We can think of a few but all come with pros and cons:
Denormalise of the Table Structure
Add multiple direct foreign key constraints between the user and content tables - so there would be a different foreign key to the content table for each content sub type.
Pros:
Joining the content table will be more optimal by using its primary key.
Cons:
There will be a lot of changes to our existing stored procedures and website code.
Maintaining up to 8 additional foreign keys (more realistically we'll only use 2 of these) will not be anywhere near as easy as the current single key.
More Denormalisation of the Table Structure
Just duplicate the fields we need from the content table into the user table directly.
Pros:
No more joins for to the content table - which significantly reduces the work SQL has to do.
Cons
Same as above: extra fields to maintain in the user table, changes to SQL and website code.
Create a Mid-Tier Indexing Layer
Using something like Lucene.NET, we'd put an indexing layer above the database. This would in theory improve performance of all search and at the same time decrease the load on the server.
Pros:
This is a good long-term solution. Lucene exists to improve search engine performance.
Cons:
There will be a much larger development cost in the short term - and we need to solve this problem ASAP.
So those are the things we've come up with and at this stage we're thinking the second option is the best - I'm aware that denormalising has it's issues however sometimes it's best to sacrifice architectural purity in order to get performance gains so we're prepared to pay that cost.
Are there any other approaches which might work for us? Are there any additional pros and/or cons with the approaches I've outlined above which may influence our decisions?
non clustered index seek from the content table using the
content_sub_type_id. This is followed by a Hash Match on the
content_group_id against the content table
This description would indicate that your expensive query filters the content table based on fields from content_type:
select ...
from content c
join content_type ct on c.content_type_id = ct.id
where ct.<field> = <value>;
This table design, and the resulting problem you just see, is quite common actually. The problems arise mainly due to the very low selectivity of the lookup tables (content_type has 2 rows, therefore the selectivity of content_type_id in content is probably 50%, huge). There are several solutions you can try:
1) Organize the content table on clustered index with content_type_id as the leading key. This would allow the join to do range scans and also avoid the key/bookmark lookup for the projection completeness. As a clustered index change, it would have implications on other queries so it has to be carefully tested. The primary key on content would obviously have to be enforced with a non-clustered constraint.
2) Pre-read the content_type_id value and then formulate the query without the join between content and content_type:
select ...
from content c
where c.content_type_id = #contentTypeId;
This works only if the selectivity of content_type_id is high (many distinct values with few rows each), which I doubt is your case (you probaly have very few content types, with many entries each).
3) Denormalize content_Type into content. You mention denormalization, but your proposal of denormalizing content into users makes little sense to me. Drop the content_type table, pull in the content_type fields into the content table itself, and live with all the denormalization problems.
4) Pre-join in a materialized view. You say you already tried that, but I doubt that you tried the right materialized view. You also need to understand that only Enterprise Edition uses the materialized view index automatically, all other editions require the NOEXPAND hint:
create view vwContentType
with schemabinding
as
select content_type_id, content_id
from dbo.content c
join dbo.content_type_id ct on c.content_type_id = ct.content_type_id;
create unique clustered index cdxContentType on vwContentType (content_type_id, content_id);
select ...
from content c
join vwContentType ct with (noexpand)
on ct.content_id = c.content_id
where ct.content_type_id = #contentTypeId;
Solutions 2), 3) and 4) are mostly academic. Given the very low selectivity of content_type_id, your only solution that has a standing chance is to make it the leading key in the clustered index of content. I did not expand the analysis to content_Sub_type, but with only 8 rows I'm willing to bet it has the very same problem, which would require to push it also into the clustered index (perhaps as the second leading key).
Related
I'm working on synchronizing clients with data for eventual consistency. The server will publish a list of database ids and rowversion/timestamp. Client will then request data with incorrect version number. The primary reason for inconsistent data is networking issues between broker nodes, split brain, etc.
When I read data from my tables, I request data based on a predicate that is not the primary key.
I iterate available regions to read data per region. This is my select:
SELECT DatabaseId, VersionTimestamp, OperationId
FROM TableX
WHERE RegionId = 1
Since this leads to an index scan per query, I'm wondering if a non-clustered index on my RegionId column, and include the selected columns in that index:
CREATE NONCLUSTERED INDEX [ID_TableX_RegionId_Sync]
ON [dbo].[TableX] ([RegionId])
INCLUDE ([DatabaseId],[VersionTimestamp],[OperationId])
VersionTimestamp is rowversion/timestamp column, and will of course change whenever a row is updated, so I'm wondering if it is a poor design choice to include this column in an index since it will need to be updated at every insert/update/delete?
Since this will result in n index scans, rather than n index seeks, it might be better to read all the data once, and then group by regionId and fill in empty lists of rows where a regionId doesn't have any data.
The real life scenario is a bit more complicated, as there are table relationships that will also have to be queried. I haven not yet looked at including one to many relationships in my version queries.
This is primarily about better understanding the impact of covering indexes and figuring out how to better use them. Since I am going to read all the data from the table in any case, it is probably cheaper to load them all at once. However, reading them as from the query above, it makes my code a lot cleaner for this simple no-relationship example alone.
Edit:
Alternative 2
Another option that came to mind, is creating a covering index on RegionId, and include my primary key (DatabaseId).
SELECT DatabaseId
FROM TableX WHERE RegionId=1
And then a new query where I select the needed columns WHERE DatabaseId IN(list, of, databaseId)
For the current scenario, there are only max thousands of rows in the table, and not in the millions. Network traffic for the two (x n) queries might most likely outweigh the benefits of using indexes, and be premature optimization.
We are moving an existing company desktop application to the cloud. I've been doing a lot of the database work and I have been optimizing them for appropriate indexes to maintain responsiveness as needed.
I was trying to optimize a couple tables and couldn't get the indexes to behave on all the calls i wanted them too so tried a non-unique cluster key on a temp table to see if it would give me better numbers since 'they would have disk locality, so it should be able to find them with sequential read rather than repeated random reads'.
I have 2 tables of concern that will definitely account for the majority of traffic, but the problem is the same. We expect millions to tens of millions of records in our user settings table. Our legacy software I have confirmed will be syncing ~1300-1500 configuration options per user into the database. Expect a table size of at least ~40-50 million rows.
My initial design of the table was this
CREATE TABLE dbo.Settings
(
SettingID BIGINT PRIMARY KEY NOT NULL IDENTITY(1,1),
CustomerID INT NOT NULL,
SettingTypeID INT NOT NULL
.... other rows
)
CREATE NONCLUSTERED INDEX [INDEX_NAME] ON dbo.Settings(CustomerID);
I think a better optimization is
CREATE TABLE dbo.Settings
(
CustomerID INT NOT NULL,
SettingTypeID INT NOT NULL,
.... other rows
)
CREATE CLUSTERED INDEX [INDEX_NAME] ON dbo.TRSettings(CustomerID);
All queries for product will be of the form, with maybe some sort of additional where condition like the specific settings I want for a given page.
SELECT * FROM dbo.Settings WHERE CustomerID=#CustomerID ...
From profiling the selects seems to be 5-50x faster, averaging about 25-30x faster. Since it can do a range scan rather than repeated lookups from the nonclustered index.
For some reason inserts are reading the same to 50% faster in some of my tests (my guess is it has to rebuild the nonclustered index and the write to the actual table).
Brought it up to our product lead and the current consensus seems to be 'we will throw more hardware at it if needed' since we'd have to spend about half a day rewriting some code to work (pretty positive new table wouldn't work with entity framework or can you access the uniquifier hidden column?), but for my knowledge are there any gotcha's I'm unaware of? It seems like for customer records where you'll often be indexing multiple numbers of the items (ie a users settings) its best to have an index like this which approximates NoSQL clustering so you can guarantee disk locality. I'm just not familiar enough with insert performance to see if there would be unexpected issues with tree rebuilding.
I’m working on an HR system and I need to keep a tracking record of all the views on the profile of a user, because each recruiter will have limited views on candidate profiles. My main concern is scalability of my approach, which is the following:
I currently created a table with 2 columns, the id of the candidate who was viewed and the id of the recruiter who viewed the candidate, each view only counts once, so if you see the same candidate again no record will be inserted.
Based on the number of recruiters and candidates in the database I can safely say that my table will grow very quick and to make things worst I have to query my table on every request, because I have to show in the UI the number of candidates that the recruiter has viewed. Which would be the best approach considering scalability?
I'll explain the case a little bit more:
We have Companies and every Company has many Recruiters.
ViewsAssigner_Identifier Table
Id: int PK
Company_Id: int FK NON-CLUSTERED
Views_Assigned: int NON-CLUSTERED
Date: date NON-CLUSTERED
CandidateViewCounts Table
Id: int PK
Recruiter_id: int FK NON-CLUSTERED ?
Candidate_id: int FK NON-CLUSTERED ?
ViewsAssigner_Identifier_Id: int FK NON-CLUSTERED ?
DateViewed: date NON-CLUSTERED
I will query a Select of all [Candidate_id] by [ViewsAssigner_Identifier_id]
We want to search by Company not by Recruiter, because all the Recruiters in the same company used the same [Views_Assigned] to the Company. In other words the first Recuiter who views the Candidate is going to be stored in "CandidateViewCounts" Table and the subsequents Recruitres who view the same candidate are not going to be stored.
Result:
I need to retrieve a list of all the [Candidate_Id] by [ViewsAssigner_Identifier_id] and then I can SUM all these Candidates Ids.
Query Example:
SELECT [Candidate_Id] FROM [dbo].[CandidateViewCounts] WHERE [ViewsAssigner_Identifier_id] = 1
Any recommendations?
If you think that each recruiter might view each candidate once, you're talking about a max of 60,000 * 2,000,000 rows. That's a large number, but they aren't very wide rows; as ErikE explained you will be able to get many rows on each page, so the total I/O even for a table scan will not be quite as bad as it sounds.
That said, for maintenance reasons, as long as you don't search by CandidateID, you may want to partition this table on RecruiterID. For example, your partition scheme could have one partition for RecruiterID between 1 and 2000, one partition for 2001 -> 4000, etc. This way you max out the number of rows per partition and can plan file space accordingly (you can put each partition on its own filegroup, separating I/O).
Another point is this: if you are looking to run queries such as "how many views on this candidate (and we don't care which recruiters)?" or "how many candidates has this recruiter viewed (and we don't care which candidates)?" then you may consider indexed views. E.g.
CREATE VIEW dbo.RecruiterViewCounts
WITH SCHEMABINDING
AS
SELECT RecruiterID, COUNT_BIG(*)
FROM dbo.tablename;
GO
CREATE UNIQUE CLUSTERED INDEX pk_rvc ON dbo.RecruiterViewCounts(RecruiterID);
GO
CREATE VIEW dbo.CandidateViewCounts
WITH SCHEMABINDING
AS
SELECT CandidateID, COUNT_BIG(*)
FROM dbo.tablename;
GO
CREATE UNIQUE CLUSTERED INDEX pk_cvc ON dbo.CandidateViewCounts(CandidateID);
GO
Now, these clustered indexes are expensive to maintain, so you'll want to test your write workload against them. But they should make those two queries extremely, extremely fast without having to seek into your large table and potentially read multiple pages for a very busy recruiter or a very popular candidate.
If your table is clustered on the RecruiterID you will have a very fast seek and in my opinion no performance issue at all.
In such a narrow table as you've described, finding out the profiles viewed for any one recruiter should require a single read 99+% of the time. (Assume fillfactor = 80 with minimal page splits; row width assuming two int columns = 16 bytes + overhead, call that 20 bytes; 8040 or so bytes per page; say they get 4 views at average 2.5 rows per recruiter = ballpark 128 recruiters per data page). The total number of rows in the table is irrelevant because it can seek into the clustered index. Yeah, it has to traverse the tree, but it is still going to be very fast. There is no better way so long as the views have to be counted once per candidate. If it were simply total views, you could keep a count instead.
I don't think you have much to worry about. If you are concerned that the system could grow to tens of thousands of request per second and you'll get some kind of limiting hotspot of activity, as long as the recruiters visiting at any one point in time do not coincidentally have sequential IDs assigned to them, you will be okay.
The big principle here is that you want to avoid anything that would have to scan the table top to bottom. You can avoid that as long as you always search by RecruiterID or RecruiterID, CandidateID. The moment you want to search by CandidateID alone, you will be in trouble without an additional index. Adding a nonclustered index on CandidateID will double the space your table takes (half for the clustered, half for the nonclustered) but that is no big deal. Then searching by CandidateID will be just as fast, because the nonclustered index will properly cover the query and no bookmark lookup will be required.
Update
This is a response to the substantially new information you provided in the update to your question.
First, your CandidateViewCounts table is named incorrectly. It's something more like CandidateFirstViewedByRecruiterAtCompany. It can only indirectly answer the question you have, which is about the Company, not the Recruiters, so in my opinion the scenario you're describing really calls for a CompanyCandidateViewed table:
CompanyID int FK
CandidateID int FK
PRIMARY KEY CLUSTERED (CompanyID, CandidateID)
Store the CompanyID of the recruiter who viewed the candidate, and the CandidateID. Simple! Now my original answer still works for you, simply swap RecruiterID with CompanyID.
If you really do want to track which recruiters viewed which candidates, then do so in a RecruiterCandidateViewed table (and store all recruiter->candidate views). That can be queried later or in a data warehouse. But your real-time OLTP needs will be met by the table described above.
Also, I would like to mention that it is possible you are putting identity columns in tables that that don't need them. You should avoid identity columns unless the column is going to be used as an FK in another table (and not always even then, as sometimes in proper data modeling in order to prevent possible denormalization you must use composite keys in FKs). For example, your ViewsAssigner_Identifier table seems to me to need some help (of course I don't have all the information here and could be off base). If the Company and the Date are what's most important about that table, make them together the clustered PK and get rid of the identity column if at all possible.
This could turn out to be the dumbest question ever.
I want to track groups and group members via SQL.
Let's say I have 3 groups and 6 people.
I could have a table such as:
Then if I wanted to have find which personIDs are in groupID 1, I would just do
select * from Table where GroupID=1
(Everyone knows that)
My problem is I have millions of rows added to this table and I would like it to do some sort of presorting about GroupID to make lookups as fast as possible.
I'm thinking of a scenario where it would have nested tables, where each sub table would contain a groupID's members. (Illustrated below)
This way when I wanted to select each GroupMembers, the structure in SQL would already be nested and not as to expensive look up as would trolling through rows.
Does such a structure exist, in essence, a table that would pivot around the groupID ? Is indexing the table about groupID the best/only option?
Perhaps you see it otherwise at the moment, but what you ask is nothing else but an index on GroupId. But there are many more shades of gray, a lot depends on how you plan to use the table (the actual queries you're going to run) and the cardinality of expected data.
Should the table be clustered by (PersonID) with a non clustered index on (GroupId))?
Should it be a clustered index on (GroupId, PersonID) with a non clustered index on (PersonId)?
Or should it be clustered by (PersonId, GroupId) with a non clustered index on (GroupId, PersonId)?
...
All are valid choices, depending on your requirements, and the choice you make is pretty much going to make or break your application.
Approaching this problem from the point of view of what EF or other ORM layer gives you will likely result in a bad database design. Ultimately your whole app, as fancy and carefully coded as as it is, is nothing but a thin shell around the the database. Consider approaching this from a sound data modeling point of view, create a good table schema design, and then write your code on top of it, not the other way around. I understand this goes against everything the preachers on the street recommend today, but I've seen too many applications designed in the Visual Studio various data context editor(s) fail in deployment...
If the inserts will typically be incremental (in other words, when you add a row you will typically add a groupid + personid that are greater than the last row) you can create a clustered index on groupid + personid and that will make SQL physically store the rows in that order and it makes a lookup on that key very fast.
The database I'm working with is currently over 100 GiB and promises to grow much larger over the next year or so. I'm trying to design a partitioning scheme that will work with my dataset but thus far have failed miserably. My problem is that queries against this database will typically test the values of multiple columns in this one large table, ending up in result sets that overlap in an unpredictable fashion.
Everyone (the DBAs I'm working with) warns against having tables over a certain size and I've researched and evaluated the solutions I've come across but they all seem to rely on a data characteristic that allows for logical table partitioning. Unfortunately, I do not see a way to achieve that given the structure of my tables.
Here's the structure of our two main tables to put this into perspective.
Table: Case
Columns:
Year
Type
Status
UniqueIdentifier
PrimaryKey
etc.
Table: Case_Participant
Columns:
Case.PrimaryKey
LastName
FirstName
SSN
DLN
OtherUniqueIdentifiers
Note that any of the columns above can be used as query parameters.
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
If you have no other choice you can partition by key module the number of partition tables.
Lets say that you want to partition to 10 tables.
You will define tables:
Case00
Case01
...
Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
Another possible thing to look at (before partitioning) is your model.
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.