I have a large table consisting of 4 Billion+ rows and 50 columns, most of which are either datetime or numeric except a few which are varchar.
Data will be inserted into the table on a weekly basis (about 20 million rows).
I expect queries with where clauses on some of the datetime columns, and a couple of the the varchar columns. There is no primary key in the table.
There are no indexes, nor the table is partitioned. I am using SQL Server 2016.
I understand that I need to partition or index the table, but I am not sure which approach to take or both in-fact.
Since the table is large, should I create the indexes first or should I create the partitions first? If I do create the indexes and then create the partitions, what should I do to maintain these with new data coming in weekly.
EDIT: Also, minimal updates and deletes are expected on the table
I understand that I need to partition or index the table
You need to understand what you gain from partitioning. It is not at all the case that SQL Server requires partitioning on big tables to function adequately. SQL Server scales to arbitrary tables sizes without any inherent issues.
Common benefits of partitioning are:
Mass deletion in constant time
Different storage for older partitions
Not backing up old partitions
Sometimes in special situations (e.g. columnstore), partitioning can help as a strategy to speed up queries. Normally, indexing is better for that.
Essentially, partitioning splits the table physically into multiple sub tables. Most often this has a negative effect on query plans. Indexes are perfectly capable of restricting the set of data that needs to be touched. Partitions are worse for that.
Most of the queries will be filtering on the datetime columns and on some of the varchar columns. Like, get data for a certain daterange for a certain entity. With the indexes, it will be fragmented a lot because of new inserts and rebuilding/reorganising the indexes will also consume a lot of time. I can do it but again not sure which approach.
It seems you can best solve this by indexing:
Index according to the queries you expect.
Maintain the indexes properly. This is not too hard. For example, rebuild them after the weekly load.
Since the table is large, should I create the indexes first or should I create the partitions first?
Set up that partitioning objects first. Then, create or rebuild the clustered index on the new partitioning scheme. If possible drop other indexes first and recreate them afterwards (might not work due to availability restrictions).
what should I do to maintain these with new data coming in weekly.
What concerns do you have? New data will be stored in the appropriate partitions automatically. Make sure to create new partitions before loading the data. Keep partitions ready for 2 weeks in advance. The latest partitions must always be empty to avoid costly splits.
There is no primary key in the table.
Most often this is a not a good design. Most tables should have a primary key and a clustered index. If there is no natural key use an artifical one such as a bigint identity.
You definitely can apply partitioning but my feeling is that it will not gain you what you maybe expect. But it will force you to take on additional maintenance burdens, possibly reduce performance and there is risk of making mistakes that threaten availability. Simplicity is important.
Related
I have a table in a production server with 350 million rows and aproximatelly 25GB size. It has a single clustered identity index.
The queries targeting this table require some missing indexes for better perfomance.
I need to delete unnecessary data (aprox 200 million rows) and then create two non-clustered indexes.
However, I have some concerns:
I need to avoid increasing the log too much
Keep the database downtime as low as possible.
Keep the identity (primary key) the same in the remaining data.
I would like to hear you opinion for the best solution to adopt.
The following is a guideline on how you might do this:
Suspend insert/update operations or start logging them explicitly (this might result in degraded performance).
Select the records to keep into a new table.
Then you have two options. If this is the only table in your universe:
Build the indexes on the new table.
Stop the system.
Rename the existing table to something else.
Rename the new table to the real table name
Turn the system back on.
If there are other tables (such as foreign key relationships):
Truncate the existing table
Insert the data into the existing table
Build the secondary indexes
Turn the system back on
Depending on your user requirements, one of the above variations is likely to work for your problem.
Note that there are other more internally intensive techniques. For instance, create a replicated database and once that is working, you have two systems and can do the clean-up work on one at a time (a method such as this would be the preferred method for a system with near 100% uptime requirements). Or create a separate table that is just right and swap the table spaces.
I've been researching best practices for creating clustered indexes and I'm just trying to totally understand these two suggestions that's listed with pretty much every BLOG or article on the matter
Columns that contain a large number of distinct values.
Queries that return large result sets.
These seem to be slightly contrary or I'm guessing maybe it just depends on how you're accessing the table.. Or my interpretation of what "large result sets" mean is wrong....
Unless you're doing range queries over the clustered column it seems like you typically won't be getting large result sets that matter. So in cases where SQL Server defaults the clustered indexes on the PK you're rarely going to fulfill the large result set suggestion but of course it does the large number of distinct values..
To give the question a little more context. This quetion stems from a vertical auditing table we have that has a column for TABLE.... Every single query that's written against this table has a
WHERE TABLE = 'TABLENAME'
But the TableName is highly non distinct... Each result set of tablenames is rather large which seems to fulfill that second conditon but it's definitely not largerly unique.... Which means all that other stuff happens with having to add the 4 byte Uniquifer (sp?) which makes the table a lot larger etc...
This situation has come up a few times for me when I've come upon DBs that have say all the contact or some accounts normalized into a single table and they are only separated by a TYPE parameter. Which is on every query....
In the case of the audit table the queries are typically not that exciting either they are just sorted by date modified, sometimes filtered by column, user that made the change etc...
My other thought with this auditing scenario was to just make the auditing table a HEAP so that inserting is fast so there's not contention between tables being audited and then to generate indexed views over the data ...
Index design is just as much art as it is science.
There are many things to consider, including:
How the table will be accessed most often: mostly inserts? any updates? more SELECTs than DML statements? Any audit table will likely have mostly inserts, no updates, rarely deletes unless there is a time-limit on the data, and some SELECTs.
For Clustered indexes, keep in mind that the data in each column of the clustered index will be copied into each non-clustered index (though not for UNIQUE indexes, I believe). This is helpful as those values are available to queries using the non-clustered index for covering, etc. But it also means that the physical space taken up by the non-clustered indexes will be that much larger.
Clustered indexes generally should either be declared with the UNIQUE keyword or be the Primary Key (though there are exceptions, of course). A non-unique clustered index will have a hidden 4-byte field called a uniqueifier that is required to make each row with a non-unique key value addressable, and is just wasted space given that the order of your rows within the non-unique groupings is not apparently obvious so trying to narrow down to a single row is still a range.
As is mentioned everywhere, the clustered index is the physical ordering of the data so you want to cater to what needs the best I/O. This relates also to the point directly above where non-unique clustered indexes have an order but if the data is truly non-unique (as opposed to unique data but missing the UNIQUE keyword when the index was created) then you miss out on a lot of the benefit of having the data physically ordered.
Regardless of any information or theory, TEST TEST TEST. There are many more factors involved that pertain to your specific situation.
So, you mentioned having a Date field as well as the TableName. If the combination of the Date and TableName is unique then those should be used as a composite key on a PK or UNIQUE CLUSTERED index. If they are not then find another field that creates the uniqueness, such as UserIDModified.
While most recommendations are to have the most unique field as the first one (due to statistics being only on the first field), this doesn't hold true for all situations. Given that all of your queries are by TableName, I would opt for putting that field first to make use of the physical ordering of the data. This way SQL Server can read more relevant data per read without having to seek to other locations on disk. You would likely also being ordering on the Date so I would put that field second. Putting TableName first will cause higher fragmentation across INSERTs than putting the Date first, but upon an index rebuild the data access will be faster as the data is already both grouped ( TableName ) and ordered ( Date ) as the queries expect. If you put Date first then the data is still ordered properly but the rows needed to satisfy the query are likely spread out across the datafile(s) which would require more I/O to get. AND, more data pages to satisfy the same query means more pages in the Buffer Pool, potentially pushing out other pages and reducing Page Life Expectancy (PLE). Also, you would then really need to inculde the Date field in all queries as any queries using only TableName (and possibly other filters but NOT using the Date field) will have to scan the clustered index or force you to create a nonclustered index with TableName being first.
I would be weary of the Heap plus Indexed View model. Yes, it might be optimized for the inserts but the system still needs to maintain the data in the indexed view across all DML statements against the heap. Again you would need to test, but I don't see that being materially better than a good choice of fields for a clustered index on the audit table.
The database I'm working with is currently over 100 GiB and promises to grow much larger over the next year or so. I'm trying to design a partitioning scheme that will work with my dataset but thus far have failed miserably. My problem is that queries against this database will typically test the values of multiple columns in this one large table, ending up in result sets that overlap in an unpredictable fashion.
Everyone (the DBAs I'm working with) warns against having tables over a certain size and I've researched and evaluated the solutions I've come across but they all seem to rely on a data characteristic that allows for logical table partitioning. Unfortunately, I do not see a way to achieve that given the structure of my tables.
Here's the structure of our two main tables to put this into perspective.
Table: Case
Columns:
Year
Type
Status
UniqueIdentifier
PrimaryKey
etc.
Table: Case_Participant
Columns:
Case.PrimaryKey
LastName
FirstName
SSN
DLN
OtherUniqueIdentifiers
Note that any of the columns above can be used as query parameters.
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
If you have no other choice you can partition by key module the number of partition tables.
Lets say that you want to partition to 10 tables.
You will define tables:
Case00
Case01
...
Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
Another possible thing to look at (before partitioning) is your model.
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.
If I have a lookup table with very few records in it (say, less than ten), should I bother putting an index on the Foreign Key of another table to which it is attached? For that matter, does the lookup table even need an index on the Primary Key?
Specifically, is there any performance benefit that outweighs the overhead of maintaining the indexes? If not, are there any benefits other than speed?
Note: an example of a lookup table might be Order Status, where the tuples are:
1 - Order Received
2 - In Process
3 - Shipped
4 - Paid
On a transactional system there may be no significant benefit to putting an index on such a column (i.e. a low cardinality reference column) as the query optimiser probably won't use it. It will also generate additional disk traffic on writes to the table as the indexes have to be updated. So for low cardinality FK's on a transactional database it is usually better not to index the columns. This particularly applies to high volume systems.
Note that you may still want the FK for referential integrity and that the FK lookup on a small reference table will probably generate no I/O as the lookup table will almost always be cached.
However, you may find that you want to include the column in a composite index for some reason - perhaps to create a covering index for a commonly used query.
On a table that is frequently bulk-loaded (e.g. a data warehouse) the index write traffic will be much larger than that of the table load if you have many indexed columns. You will probably need to drop or disable the FKs and indexes for a bulk load if any indexes are present.
On a Star Schema you can get some benefit from indexing low cardinality columns, even on SQL Server. If you are doing a highly selective query (i.e. one where the query optimiser decides that the row set returned will be small) then it can do a 'star query' plan where it uses a technique known as index intersection.
Generally, query plans on a star schema should be based around a table scan of the fact table or a highly selective process that bookmarks the fact table and then returns a smaller set of rows. Index intersection is efficient for the latter type of query as the selection can be resolved before doing any I/O on the fact table.
Bitmap indexes are a real win for low cardinality columns on platforms such as Oracle that support them, but SQL Server does not. Even so, low cardinality indexes can still participate in star query plans on SQL Server.
Yes, always have an index.
The query optimizer of a modern database management system (DBMS) will make the determination as to which is faster: (1) actually reading from an index on a column, (2) performing a full table scan.
The table size (in number of rows) needs to be "large enough" for use of the index to be considered.
Yes to both. Always index as a rule of thumb.
Points:
You also can't set up an FK without a unique index on the lookup table
What if you want to delete or update in the lookup table? Especially accidently...
However, saying that, we don't always.
We have very OLTP table (5 million rows+ per day) with several parent tables. We only indexes on the FK columns where we need them. We assume no deletes/key updates on some parent tables, so we reduce the amount of work needed and disk space used.
We used the SQL Server 2005 dmvs to establish that indexes weren't used. We still have the FK in place though.
My personal opinion is that you should... it may be small now but ALWAYS anticipate your tables growing in size. A good database schema will grow easily with more records. Foreign Keys are almost always a good idea.
In sql server, the primary key is the clustered index if there isn't one already (clustered index that is).
Suppose you have a very large database, and to simplify lets say it consists of one major table you will be doing your lookups on with one (and only one) primary key field - pk.
Given the fact that all lookups are going to be basically SELECT * FROM table_name WHERE pk=someKeyValue, what is the best way to optimize this database for the fastest lookups?
Edit: just a few more details - INSERTs and UPDATEs are going to be very non-frequent so I don't mind sacrificing performance there to achieve better lookup performance.
Also, seems like clustering is the way to go. Do you have any examples of the kind of increase in performance I can achieve with this method? And how exactly is this done (on any kind of DB)?
If the primary key is clustered, then you won't get any quicker.
If it isn't clustered, and the number of columns in your table is relatively small, then you could in theory create a covering index to speed up the query. But then this negates any insert/update performance enhancements that having the non-clustered primary key would have given you.
If your primary key is an always-increasing field (e.g. a SQL Server identity, or generated from a sequence in Oracle) then the clustered primary key has no drawbacks anyway.
One thing you could do is make the primary key clustered, this results in the actual data being physically ordered on the disk, resulting in faster queries.
It will also mean slower inserts, but if you select much more frequently than you insert, this should not be a problem.
If you're using MySQL, you can do some additional things (beyond tuning your cache values). The table engine can be a factor; for instance, MyISAM is widely held to be faster at SELECTs than InnoDB. If this table is primarily a lookup table, and you were using MySQL, that might be a good thing to do. (InnoDB is pretty good on average; it's better on writes than MyISAM, and also, InnoDB never needs to be repaired.)
I have to add two more options to all that was proposed above (I like dwc’s answer). You should consider partitioning if your table is really big.
First, horizontal partitioning (especially if I/O is bottleneck in your DB). You create several filegroups and locate them on different hard drives. Then, create Partition Function, Partition Scheme to divide your table and put parts of your table on separate HDs (like rows 1-499999 to the F: drive, 500000-999999 to the G: drive, and so on) .
Second, vertical partitioning. This would work if you select column sets (not *) in most of your queries. In this case, divide columns in the table in two groups: first, fields that you need in all queries; second, fields that you rarely need. Create two tables with the same primary key. Use JOINs on the primary key when you need columns from both tables.
(This answer pertains to SQL Server 2005/2008.)
If all your queries are going to be based off the PK, you wouldn't get any added benefit by setting an index on the PK since it should already be indexing by that.
Edit: The only other possible things I would suggest is looking at normalizing your table (if that is even an option or necessity). By splitting off items into other tables, you can refine what is being pulled back in each query and only pull the less-used items when needed using joins.
Based off the limited description of "a very large database with a single table" it is hard to locate any easy and obvious ways to optimize without looking at what kind of data you are actually storing in your fields.
If your PK order matches insertion order, i.e. time or id/autoincrement, then make it clustered. This will reduce disk and cache thrashing on inserts, leaving more resources to devote to lookups.
Consider tweaking page sizes on the table to be an exact multiple of your record size. This requires intimate knowledge of the particular database software for details of how, and record/index overhead, etc.
If practical, use fixed-size for all columns rather than variable size.
Consider putting the index and/or transaction log files on a separate volume.
Install as much RAM as the software and hardware can use.
If you were using Oracle then I'd advise benchmarking three approaches:
Heap table with primary key index
Index-organised table
Single table hash cluster
1 represents a very vanilla approach -- really it's the lowest common denominator, but could mean 5+ logical reads to get each row, with one of those being a probable physical read of the table if it is not completely cached.
2 will save you one of those logical read by avoiding the probe to a separate table segment, but might not save you the physical read because the IOT segment will be larger and harder to cache than the index alone.
3 will potentially get you the row with a single logical read, but unless you have the entire table cached that's probably going to translate into a physical read.
Benchmarking is highly recommended.