I'm working on synchronizing clients with data for eventual consistency. The server will publish a list of database ids and rowversion/timestamp. Client will then request data with incorrect version number. The primary reason for inconsistent data is networking issues between broker nodes, split brain, etc.
When I read data from my tables, I request data based on a predicate that is not the primary key.
I iterate available regions to read data per region. This is my select:
SELECT DatabaseId, VersionTimestamp, OperationId
FROM TableX
WHERE RegionId = 1
Since this leads to an index scan per query, I'm wondering if a non-clustered index on my RegionId column, and include the selected columns in that index:
CREATE NONCLUSTERED INDEX [ID_TableX_RegionId_Sync]
ON [dbo].[TableX] ([RegionId])
INCLUDE ([DatabaseId],[VersionTimestamp],[OperationId])
VersionTimestamp is rowversion/timestamp column, and will of course change whenever a row is updated, so I'm wondering if it is a poor design choice to include this column in an index since it will need to be updated at every insert/update/delete?
Since this will result in n index scans, rather than n index seeks, it might be better to read all the data once, and then group by regionId and fill in empty lists of rows where a regionId doesn't have any data.
The real life scenario is a bit more complicated, as there are table relationships that will also have to be queried. I haven not yet looked at including one to many relationships in my version queries.
This is primarily about better understanding the impact of covering indexes and figuring out how to better use them. Since I am going to read all the data from the table in any case, it is probably cheaper to load them all at once. However, reading them as from the query above, it makes my code a lot cleaner for this simple no-relationship example alone.
Edit:
Alternative 2
Another option that came to mind, is creating a covering index on RegionId, and include my primary key (DatabaseId).
SELECT DatabaseId
FROM TableX WHERE RegionId=1
And then a new query where I select the needed columns WHERE DatabaseId IN(list, of, databaseId)
For the current scenario, there are only max thousands of rows in the table, and not in the millions. Network traffic for the two (x n) queries might most likely outweigh the benefits of using indexes, and be premature optimization.
Related
What is the index creating strategy?
Is it possible to create more than one non-clustered index on the same column in SQL Server?
How about creating clustered and non-clustered on same column?
Very sorry, but indexing is very confusing to me.
Is there any way to find out the estimated query execution time in SQL Server?
The words are rather logical and you'll learn them quite quickly. :)
In layman's terms, SEEK implies seeking out precise locations for records, which is what the SQL Server does when the column you're searching in is indexed, and your filter (the WHERE condition) is accurrate enough.
SCAN means a larger range of rows where the query execution planner estimates it's faster to fetch a whole range as opposed to individually seeking each value.
And yes, you can have multiple indexes on the same field, and sometimes it can be a very good idea. Play out with the indexes and use the query execution planner to determine what happens (shortcut in SSMS: Ctrl + M). You can even run two versions of the same query and the execution planner will easily show you how much resources and time is taken by each, making optimization quite easy.
But to expand on these a bit, say you have an address table like so, and it has over 1 billion records:
CREATE TABLE ADDRESS
(ADDRESS_ID INT -- CLUSTERED primary key ADRESS_PK_IDX
, PERSON_ID INT -- FOREIGN KEY, NONCLUSTERED INDEX ADDRESS_PERSON_IDX
, CITY VARCHAR(256)
, MARKED_FOR_CHECKUP BIT
, **+n^10 different other columns...**)
Now, if you want to find all the address information for person 12345, the index on PERSON_ID is perfect. Since the table has loads of other data on the same row, it would be inefficient and space-consuming to create a nonclustered index to cover all other columns as well as PERSON_ID. In this case, SQL Server will execute an index SEEK on the index in PERSON_ID, then use that to do a Key Lookup on the clustered index in ADDRESS_ID, and from there return all the data in all other columns on that same row.
However, say you want to search for all the persons in a city, but you don't need other address information. This time, the most effective way would be to create an index on CITY and use INCLUDE option to cover PERSON_ID as well. That way, a single index seek / scan would return all the information you need without the need to resort to checking the CLUSTERED index for the PERSON_ID data on the same row.
Now, let's say both of those queries are required but still rather heavy because of the 1 billion records. But there's one special query that needs to be really really fast. That query wants all the persons on addresses that have been MARKED_FOR_CHECKUP, and who must live in New York (ignore whatever checkup means, that doesn't matter). Now you might want to create a third, filtered index on MARKED_FOR_CHECKUP and CITY, with INCLUDE covering PERSON_ID, and with a filter saying CITY = 'New York' and MARKED_FOR_CHECKUP = 1. This index would be insanely fast, as it only ever cover queries that satisfy those exact conditions, and therefore has a fraction of the data to go through compared to the other indexes.
(Disclaimer here, bear in mind that the query execution planner is not stupid, it can use multiple nonclustered indexes together to produce the correct results, so the examples above may not be the best ones available as it's very hard to imagine when you would need 3 different indexes covering the same column, but I'm sure you get the idea.)
The types of index, their columns, included columns, sorting orders, filters etc depend entirely on the situation. You will need to make covering indexes to satisfy several different types of queries, as well as customized indexes created specifically for singular, important queries. Each index takes up space on the HDD so making useless indexes is wasteful and requires extra maintenance whenever the data model changes, and wastes time in defragmentation and statistics update operations though... so you don't want to just slap an index on everything either.
Experiment, learn and work out which works best for your needs.
I'm not the expert on indexing either, but here is what I know.
You can have only ONE Clustered Index per table.
You can have up to a certain limit of non clustered indexes per table. Refer to http://social.msdn.microsoft.com/Forums/en-US/63ba3877-e0bd-4417-a04b-19c3bfb02ac9/maximum-number-of-index-per-table-max-no-of-columns-in-noncluster-index-in-sql-server?forum=transactsql
Indexes should just have different names, but its better not to use the same column(s) on a lot of different indexes as you will run into some performance problems.
A very important point to remember is that Indexes although it makes your select faster, influence your Insert/Update/Delete speed as the information needs to be added to the index, which means that the more indexes you have on a column that gets updated a lot, will drastically reduce the speed of the update.
You can include columns that is used on a CLUSTERED index in one or more NON-CLUSTERED indexes.
Here is some more reading material
http://www.sqlteam.com/article/sql-server-indexes-the-basics
http://www.programmerinterview.com/index.php/database-sql/what-is-an-index/
EDIT
Another point to remember is that an index takes up space just like the table. The more indexes you create the more space it uses, so try not to use char/varchar (or nchar/nvarchar) in an index. It uses to much space in the index, and on huge columns give basically no benefit. When your Indexes start to become bigger than your table, it also means that you have to relook your index strategy.
I’m working on an HR system and I need to keep a tracking record of all the views on the profile of a user, because each recruiter will have limited views on candidate profiles. My main concern is scalability of my approach, which is the following:
I currently created a table with 2 columns, the id of the candidate who was viewed and the id of the recruiter who viewed the candidate, each view only counts once, so if you see the same candidate again no record will be inserted.
Based on the number of recruiters and candidates in the database I can safely say that my table will grow very quick and to make things worst I have to query my table on every request, because I have to show in the UI the number of candidates that the recruiter has viewed. Which would be the best approach considering scalability?
I'll explain the case a little bit more:
We have Companies and every Company has many Recruiters.
ViewsAssigner_Identifier Table
Id: int PK
Company_Id: int FK NON-CLUSTERED
Views_Assigned: int NON-CLUSTERED
Date: date NON-CLUSTERED
CandidateViewCounts Table
Id: int PK
Recruiter_id: int FK NON-CLUSTERED ?
Candidate_id: int FK NON-CLUSTERED ?
ViewsAssigner_Identifier_Id: int FK NON-CLUSTERED ?
DateViewed: date NON-CLUSTERED
I will query a Select of all [Candidate_id] by [ViewsAssigner_Identifier_id]
We want to search by Company not by Recruiter, because all the Recruiters in the same company used the same [Views_Assigned] to the Company. In other words the first Recuiter who views the Candidate is going to be stored in "CandidateViewCounts" Table and the subsequents Recruitres who view the same candidate are not going to be stored.
Result:
I need to retrieve a list of all the [Candidate_Id] by [ViewsAssigner_Identifier_id] and then I can SUM all these Candidates Ids.
Query Example:
SELECT [Candidate_Id] FROM [dbo].[CandidateViewCounts] WHERE [ViewsAssigner_Identifier_id] = 1
Any recommendations?
If you think that each recruiter might view each candidate once, you're talking about a max of 60,000 * 2,000,000 rows. That's a large number, but they aren't very wide rows; as ErikE explained you will be able to get many rows on each page, so the total I/O even for a table scan will not be quite as bad as it sounds.
That said, for maintenance reasons, as long as you don't search by CandidateID, you may want to partition this table on RecruiterID. For example, your partition scheme could have one partition for RecruiterID between 1 and 2000, one partition for 2001 -> 4000, etc. This way you max out the number of rows per partition and can plan file space accordingly (you can put each partition on its own filegroup, separating I/O).
Another point is this: if you are looking to run queries such as "how many views on this candidate (and we don't care which recruiters)?" or "how many candidates has this recruiter viewed (and we don't care which candidates)?" then you may consider indexed views. E.g.
CREATE VIEW dbo.RecruiterViewCounts
WITH SCHEMABINDING
AS
SELECT RecruiterID, COUNT_BIG(*)
FROM dbo.tablename;
GO
CREATE UNIQUE CLUSTERED INDEX pk_rvc ON dbo.RecruiterViewCounts(RecruiterID);
GO
CREATE VIEW dbo.CandidateViewCounts
WITH SCHEMABINDING
AS
SELECT CandidateID, COUNT_BIG(*)
FROM dbo.tablename;
GO
CREATE UNIQUE CLUSTERED INDEX pk_cvc ON dbo.CandidateViewCounts(CandidateID);
GO
Now, these clustered indexes are expensive to maintain, so you'll want to test your write workload against them. But they should make those two queries extremely, extremely fast without having to seek into your large table and potentially read multiple pages for a very busy recruiter or a very popular candidate.
If your table is clustered on the RecruiterID you will have a very fast seek and in my opinion no performance issue at all.
In such a narrow table as you've described, finding out the profiles viewed for any one recruiter should require a single read 99+% of the time. (Assume fillfactor = 80 with minimal page splits; row width assuming two int columns = 16 bytes + overhead, call that 20 bytes; 8040 or so bytes per page; say they get 4 views at average 2.5 rows per recruiter = ballpark 128 recruiters per data page). The total number of rows in the table is irrelevant because it can seek into the clustered index. Yeah, it has to traverse the tree, but it is still going to be very fast. There is no better way so long as the views have to be counted once per candidate. If it were simply total views, you could keep a count instead.
I don't think you have much to worry about. If you are concerned that the system could grow to tens of thousands of request per second and you'll get some kind of limiting hotspot of activity, as long as the recruiters visiting at any one point in time do not coincidentally have sequential IDs assigned to them, you will be okay.
The big principle here is that you want to avoid anything that would have to scan the table top to bottom. You can avoid that as long as you always search by RecruiterID or RecruiterID, CandidateID. The moment you want to search by CandidateID alone, you will be in trouble without an additional index. Adding a nonclustered index on CandidateID will double the space your table takes (half for the clustered, half for the nonclustered) but that is no big deal. Then searching by CandidateID will be just as fast, because the nonclustered index will properly cover the query and no bookmark lookup will be required.
Update
This is a response to the substantially new information you provided in the update to your question.
First, your CandidateViewCounts table is named incorrectly. It's something more like CandidateFirstViewedByRecruiterAtCompany. It can only indirectly answer the question you have, which is about the Company, not the Recruiters, so in my opinion the scenario you're describing really calls for a CompanyCandidateViewed table:
CompanyID int FK
CandidateID int FK
PRIMARY KEY CLUSTERED (CompanyID, CandidateID)
Store the CompanyID of the recruiter who viewed the candidate, and the CandidateID. Simple! Now my original answer still works for you, simply swap RecruiterID with CompanyID.
If you really do want to track which recruiters viewed which candidates, then do so in a RecruiterCandidateViewed table (and store all recruiter->candidate views). That can be queried later or in a data warehouse. But your real-time OLTP needs will be met by the table described above.
Also, I would like to mention that it is possible you are putting identity columns in tables that that don't need them. You should avoid identity columns unless the column is going to be used as an FK in another table (and not always even then, as sometimes in proper data modeling in order to prevent possible denormalization you must use composite keys in FKs). For example, your ViewsAssigner_Identifier table seems to me to need some help (of course I don't have all the information here and could be off base). If the Company and the Date are what's most important about that table, make them together the clustered PK and get rid of the identity column if at all possible.
I've been researching best practices for creating clustered indexes and I'm just trying to totally understand these two suggestions that's listed with pretty much every BLOG or article on the matter
Columns that contain a large number of distinct values.
Queries that return large result sets.
These seem to be slightly contrary or I'm guessing maybe it just depends on how you're accessing the table.. Or my interpretation of what "large result sets" mean is wrong....
Unless you're doing range queries over the clustered column it seems like you typically won't be getting large result sets that matter. So in cases where SQL Server defaults the clustered indexes on the PK you're rarely going to fulfill the large result set suggestion but of course it does the large number of distinct values..
To give the question a little more context. This quetion stems from a vertical auditing table we have that has a column for TABLE.... Every single query that's written against this table has a
WHERE TABLE = 'TABLENAME'
But the TableName is highly non distinct... Each result set of tablenames is rather large which seems to fulfill that second conditon but it's definitely not largerly unique.... Which means all that other stuff happens with having to add the 4 byte Uniquifer (sp?) which makes the table a lot larger etc...
This situation has come up a few times for me when I've come upon DBs that have say all the contact or some accounts normalized into a single table and they are only separated by a TYPE parameter. Which is on every query....
In the case of the audit table the queries are typically not that exciting either they are just sorted by date modified, sometimes filtered by column, user that made the change etc...
My other thought with this auditing scenario was to just make the auditing table a HEAP so that inserting is fast so there's not contention between tables being audited and then to generate indexed views over the data ...
Index design is just as much art as it is science.
There are many things to consider, including:
How the table will be accessed most often: mostly inserts? any updates? more SELECTs than DML statements? Any audit table will likely have mostly inserts, no updates, rarely deletes unless there is a time-limit on the data, and some SELECTs.
For Clustered indexes, keep in mind that the data in each column of the clustered index will be copied into each non-clustered index (though not for UNIQUE indexes, I believe). This is helpful as those values are available to queries using the non-clustered index for covering, etc. But it also means that the physical space taken up by the non-clustered indexes will be that much larger.
Clustered indexes generally should either be declared with the UNIQUE keyword or be the Primary Key (though there are exceptions, of course). A non-unique clustered index will have a hidden 4-byte field called a uniqueifier that is required to make each row with a non-unique key value addressable, and is just wasted space given that the order of your rows within the non-unique groupings is not apparently obvious so trying to narrow down to a single row is still a range.
As is mentioned everywhere, the clustered index is the physical ordering of the data so you want to cater to what needs the best I/O. This relates also to the point directly above where non-unique clustered indexes have an order but if the data is truly non-unique (as opposed to unique data but missing the UNIQUE keyword when the index was created) then you miss out on a lot of the benefit of having the data physically ordered.
Regardless of any information or theory, TEST TEST TEST. There are many more factors involved that pertain to your specific situation.
So, you mentioned having a Date field as well as the TableName. If the combination of the Date and TableName is unique then those should be used as a composite key on a PK or UNIQUE CLUSTERED index. If they are not then find another field that creates the uniqueness, such as UserIDModified.
While most recommendations are to have the most unique field as the first one (due to statistics being only on the first field), this doesn't hold true for all situations. Given that all of your queries are by TableName, I would opt for putting that field first to make use of the physical ordering of the data. This way SQL Server can read more relevant data per read without having to seek to other locations on disk. You would likely also being ordering on the Date so I would put that field second. Putting TableName first will cause higher fragmentation across INSERTs than putting the Date first, but upon an index rebuild the data access will be faster as the data is already both grouped ( TableName ) and ordered ( Date ) as the queries expect. If you put Date first then the data is still ordered properly but the rows needed to satisfy the query are likely spread out across the datafile(s) which would require more I/O to get. AND, more data pages to satisfy the same query means more pages in the Buffer Pool, potentially pushing out other pages and reducing Page Life Expectancy (PLE). Also, you would then really need to inculde the Date field in all queries as any queries using only TableName (and possibly other filters but NOT using the Date field) will have to scan the clustered index or force you to create a nonclustered index with TableName being first.
I would be weary of the Heap plus Indexed View model. Yes, it might be optimized for the inserts but the system still needs to maintain the data in the indexed view across all DML statements against the heap. Again you would need to test, but I don't see that being materially better than a good choice of fields for a clustered index on the audit table.
I'm working in SQL Server 2008 R2
As part of a complete schema rebuild, I am creating a table that will be used to store advertising campaign performance by zipcode by day. The table setup I'm thinking of is something like this:
CREATE TABLE [dbo].[Zip_Perf_by_Day] (
[CampaignID] int NOT NULL,
[ZipCode] int NOT NULL,
[ReportDate] date NOT NULL,
[PerformanceMetric1] int NOT NULL,
[PerformanceMetric2] int NOT NULL,
[PerformanceMetric3] int NOT NULL,
and so on... )
Now the combination of CampaignID, ZipCode, and ReportDate is a perfect natural key, they uniquely identify a single entity, and there shouldn't be 2 records for the same combination of values. Also, almost all of my queries to this table are going to be filtered on 1 or more of these 3 columns. However, when thinking about my clustered index for this table, I run into a problem. These 3 columns do not increment over time. ReportDate is OK, but CampaignID and Zipcode are going to be all over the place while inserting rows. I can't even order them ahead of time because results come in from different sources during the day, so data for CampaignID 50000 might be inserted at 10am, and CampaignID 30000 might come in at 2pm. If I use the PK as my clustered index, I'm going to run into fragmentation problems.
So I was thinking that I need an Identity ID column, let's call it PerformanceID. I can see no case where I would ever use PerformanceID in either the select list or where clause of any query. Should I use PerformanceID as my PK and clustered index, and then set up a unique constraint and non-clustered indexes on CampaignID, ZipCode, and ReportDate? Should I keep those 3 columns as my PK and just have my clustered index on PerformanceID? (<- This is the option I'm leaning towards right now) Is it OK to have a slightly fragmented table? Is there another option I haven't considered? I am looking for what would give me the best read performance, while not completely destroying write performance.
Some actual usage information. This table will get written to in batches. Feeds come in at various times during the day, they get processed, and this table gets written to. It's going to get heavily read, as by-day performance is important around here. When I fill this table, it should have about 5 million rows, and will grow at a pace of about 8,000 - 10,000 rows per day.
In my experience, you probably do want to use another INT Identity field as your clustered index key. I would also add a UNIQUE constraint to that one (it helps with execution plans).
A big part of the reason is space - if you use a 3 field key for your clustered index, you will have all 3 fields in every row of every non-clustered index on that table (as your clustered index row identifier). If you only plan to have a couple of indexes that isn't a big deal, but if you have a lot of them it can make a big difference. The more data per row, the more pages needed and the more IO you have.
Fragmentation is a very real issue that can cause major performance problems, especially as the table grows.
Having that additional cluster key will also mean writes will be faster for your inserts. All new rows will go to the end of your table, which means existing rows won't be touched or rearranged.
If you want to use those three fields as a FK in other tables, then by all means have them as your PK.
For the most part it doesn't really matter if you ever directly reference your clustered index key. As long as it is narrow, increasing, and unique you should be in good shape.
EDIT:
As Damien points out in the comments, if you will be filtering on single fields of your PK, you will need to have an index on each one (or always use the first field in the covering index).
On the information given (ReportDate, CampaignID, ZipCode) or (ReportDate, ZipCode, CampaignID) seem like better candidates for the clustered index than a surrogate key. Defragmentation would be a potential concern if the time taken to rebuild indexes became prohibitive but given the sizes I would expect for this table (10s or 1000s rather than 1,000,000s of rows per day) that seems unlikely to be an issue.
If I understood all you have written correctly you are opting out of natural clustering due to fragmentation penalties.
For this purpose you consider meaningless IDs which will:
avoid insert penalties for clustered index when inserting out of order batches (great for write performance)
guarantee that your data is fragmented for reads that put conditions on the natural key (not so good for read performance)
JNK point's out that fragmentation can be a real issue, however you need to establish a baseline against which you will measure and you need to establish if reading or writing is more important to you (or how important they are in measurable terms).
There's nothing that will beat a good test case - so finally that is the best recommendation I can give.
With databases it is often relatively easy to build scripts that will create real benchmarks with real workloads and realistic data quantities.
The database I'm working with is currently over 100 GiB and promises to grow much larger over the next year or so. I'm trying to design a partitioning scheme that will work with my dataset but thus far have failed miserably. My problem is that queries against this database will typically test the values of multiple columns in this one large table, ending up in result sets that overlap in an unpredictable fashion.
Everyone (the DBAs I'm working with) warns against having tables over a certain size and I've researched and evaluated the solutions I've come across but they all seem to rely on a data characteristic that allows for logical table partitioning. Unfortunately, I do not see a way to achieve that given the structure of my tables.
Here's the structure of our two main tables to put this into perspective.
Table: Case
Columns:
Year
Type
Status
UniqueIdentifier
PrimaryKey
etc.
Table: Case_Participant
Columns:
Case.PrimaryKey
LastName
FirstName
SSN
DLN
OtherUniqueIdentifiers
Note that any of the columns above can be used as query parameters.
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
If you have no other choice you can partition by key module the number of partition tables.
Lets say that you want to partition to 10 tables.
You will define tables:
Case00
Case01
...
Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
Another possible thing to look at (before partitioning) is your model.
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.