Does MS SQL server support table partitioning by List? - sql-server

I am pretty new to table partitioning technique supported by MS SQL server. I have a huge table that has more than 40 millions of records and want to apply table partitioning to this table. Most of the examples I find about the partition function is to define the partition function as Range LEFT|RIGHT for Values(......), but what I need exactly is to something like following example I found from Oracle web page:
CREATE TABLE q1_sales_by_region
(...,
...,
...,
state varchar2(2))
PARTITION BY LIST (state)
(PARTITION q1_northwest VALUES ('OR', 'WA'),
PARTITION q1_southwest VALUES ('AZ', 'UT', 'NM'),
PARTITION q1_northeast VALUES ('NY', 'VM', 'NJ'),
PARTITION q1_southeast VALUES ('FL', 'GA'),
PARTITION q1_northcentral VALUES ('SD', 'WI'),
PARTITION q1_southcentral VALUES ('OK', 'TX'));
);
The example shows that we can specify a PARTITION BY LIST clause in the CREATE TABLE statement, and the PARTITION clauses specify lists of discrete values that qualify rows to be included in the partition.
My question is does MS SQL server support table partitioning by List as well?

It does not. SQL Server's partitioned tables only support range partitioning.
In this circumstance, you may wish instead to consider using a Partitioned View.
There are a number of restrictions (scroll down slightly from the link anchor) that apply to partitioned views but the key here is that the partitioning is based on CHECK constraints within the underlying tables and one form the CHECK can take is <col> IN (value_list).
However, setting up partitioned views is considerably more "manual" than creating a partitioned table - each table that holds some of the view data has to be individually and explicitly created.

You can achieve this by using ausillary computed persisted column.
Here you can find a complete example:
LIST Partitioning in SQL Server
The idea is to create a computed column based on your list like this:
alter table q1_sales_by_region add calc_field (case when q1_northwest in ('OR', 'WA') then 1...end) PERSISTED
And then partition on this calc_field using standard range partition function

What are you trying to accomplish with partitioning? 40M rows was huge 20 years ago but commonplace nowadays. Index and query tuning is especially important for performance of large tables, although partitioning can improve performance of large scans when the partitioning column is not the leftmost clustered index key column and partitions can be eliminated during query processing.
For improved manageability and control over physical placement on different filegroups, you can use range partitioning with a filegroup per region. For example:
CREATE TABLE q1_sales_by_region
(
--
state char(2)
);
CREATE PARTITION FUNCTION PF_State(char(2)) AS RANGE RIGHT FOR VALUES(
'AZ'
, 'FL'
, 'GA'
, 'NJ'
, 'NM'
, 'NY'
, 'OK'
, 'OR'
, 'SD'
, 'TX'
, 'UT'
, 'VM'
, 'WA'
, 'WI'
);
CREATE PARTITION SCHEME PS_State AS PARTITION PF_State TO(
[PRIMARY] --unused
, q1_southwest --'AZ'
, q1_southeast --'FL'
, q1_southeast --'GA'
, q1_northeast --'NJ'
, q1_southwest --'NM'
, q1_northeast --'NY'
, q1_southcentral --'OK'
, q1_northwest --'OR'
, q1_northcentral --'SD'
, q1_southcentral --'TX'
, q1_southwest --'UT'
, q1_northeast --'VM'
, q1_northwest --'WA'
, q1_northcentral --'WI'
);
You can also add a check constraint if you don't already have a related table to enforce only valid state values:
ALTER TABLE q1_sales_by_region
ADD CONSTRAINT ck_q1_sales_by_region_state
CHECK (state IN('OR', 'WA', 'AZ', 'UT', 'NM','NY', 'VM', 'NJ','FL', 'GA','SD', 'WI','OK', 'TX'));

Related

group by on clustering key is not reading from metadata

I have defined cluster key on one of the column "time periods" , when i use where clause it operates on metadata that I can see in history profile of below query
select count(*) from table where time_period = 'Jan 2021'
but when i use group by to know count of each month , it scan all the partition.
select time_period , count(*) from table group by time_period
Why the second query is not the metadata operation ..?
select time_period , count(*) from table group by time_period;
is a full table scan.
select count(*) from table where time_period = 'Jan 2021'
is a full scan on partitions with the time_period equal to one value, so the meta data is searched to find the matching partitions, thus the pruning.
if you table has values from 'Jan 2020' to 'Jan 2021' and assuming those are dates not strings (which would be very bad for performance), and assuming you data is clustered on time_period (or naturally inserting in "months") then
select time_period, count(*)
from table
where time_period >= '2021-06-01'
group by 1 order by 1;
should only read ~50% of your partitions, as the assumed order of the data, means only half the tables need to be read.
Answering the "meta-data" vs "scanning" question. This is based on years of working with query optimization, and is "very well educated speculation".
There is big difference between "COUNT()" and "COUNT() ... GROUP BY". The latter is much more complex and handles much more complex queries.
Optimizers evolve over time to handle special cases, but they start out focusing on more common types of queries.
The non-GROUP query against a non-keyed but well clustered table with use a scan. It's a specialized optimization, meaningful, optimization for a special case.
But the same specialization is not present in the GROUP BY, which addresses a much broader class of queries, with GROUP BY and WHERE clauses for multiple non-cluser-key columns.
The COUNT() GROUP BY would need to add a special check for this particular query form; once anything else is added, the meta-data would not be sufficient.
So no specialized optimization for this specific case in COUNT(), GROUP BY

Option to use wild card in 'CREATE Partition function' for the datetime columns?

I want to partition first five days of the month.
Following is the way I achieved it.
CREATE PARTITION FUNCTION [pf_sampleTable](datetime) AS RANGE LEFT FOR VALUES (
N'2019-12-01T00:00:00.000'
, N'2019-12-02T00:00:00.000'
, N'2019-12-03T00:00:00.000'
, N'2019-12-04T00:00:00.000'
, N'2019-12-05T00:00:00.000'
)
GO
In this technique there should be an update operation to define new time frames every month.
I was wondering if we could use something like a wild card in the datetime fields.
CREATE PARTITION FUNCTION [pf_sampleTable](datetime) AS RANGE LEFT FOR VALUES (
N'%-01T00:00:00.000'
, N'%-02T00:00:00.000'
, N'%-03T00:00:00.000'
, N'%-04T00:00:00.000'
, N'%-05T00:00:00.000'
)
GO
CREATE PARTITION FUNCTION DDL creates static partitions. Although one can specify expressions for the partition boundaries that are evaluated when the statement is run, these are not evaluated afterwards. It is necessary to ALTER the function to create or drop partitions after creation.
Consider scheduling a daily job to execute the needed script (and perhaps remove old partitions) as desired.
I suggest a RANGE RIGHT function when partitioning on temporal types that have a time component so that values that are exactly midnight don't end up in the wrong partition. The example below will create future date partitions 2 days in advance to avoid expensive data movement when splitting partitions.
--initial 5 boundaries
CREATE PARTITION FUNCTION PF_DateTime(datetime) AS
RANGE RIGHT FOR VALUES(
NULL -- (dates outside expected range)
, N'2019-12-01T00:00:00.000'
, N'2019-12-02T00:00:00.000'
, N'2019-12-03T00:00:00.000'
, N'2019-12-04T00:00:00.000'
, N'2019-12-05T00:00:00.000'
);
CREATE PARTITION SCHEME PS_DateTime AS
PARTITION PF_DateTime ALL TO ([PRIMARY]);
--run this after midnight on 2019-12-04 to create the 2019-12-06 boundary
ALTER PARTITION SCHEME PS_DateTime
NEXT USED [PRIMARY];
ALTER PARTITION FUNCTION PF_DateTime()
SPLIT RANGE(CAST(DATEADD(day, 2, GETDATE()) AS date));
See table partitioning best practices for more information.

SQL query runs into a timeout on a sparse dataset

For sync purposes, I am trying to get a subset of the existing objects in a table.
The table has two fields, [Group] and Member, which are both stringified Guids.
All rows together may be to large to fit into a datatable; I already encountered an OutOfMemory exception. But I have to check that everything I need right now is in the datatable. So I take the Guids I want to check (they come in chunks of 1000), and query only for the related objects.
So, instead of filling my datatable once with all
SELECT * FROM Group_Membership
I am running the following SQL query against my SQL database to get related objects for one thousand Guids at a time:
SELECT *
FROM Group_Membership
WHERE
[Group] IN (#Guid0, #Guid1, #Guid2, #Guid3, #Guid4, #Guid5, ..., #Guid999)
The table in question now contains a total of 142 entries, and the query already times out (CommandTimeout = 30 seconds). On other tables, which are not as sparsely populated, similar queries don't time out.
Could someone shed some light on the logic of SQL Server and whether/how I could hint it into the right direction?
I already tried to add a nonclustered index on the column Group, but it didn't help.
I'm not sure that WHERE IN will be able to maximally use an index on [Group], or if at all. However, if you had a second table containing the GUID values, and furthermore if that column had an index, then a join might perform very fast.
Create a temporary table for the GUIDs and populate it:
CREATE TABLE #Guids (
Guid varchar(255)
)
INSERT INTO #Guids (Guid)
VALUES
(#Guid0, #Guid1, #Guid2, #Guid3, #Guid4, ...)
CREATE INDEX Idx_Guid ON #Guids (Guid);
Now try rephrasing your current query using a join instead of a WHERE IN (...):
SELECT *
FROM Group_Membership t1
INNER JOIN #Guids t2
ON t1.[Group] = t2.Guid;
As a disclaimer, if this doesn't improve the performance, it could be because your table has low cardinality. In such a case, an index might not be very effective.

SQL Server nonclustered indexes

I am trying to figure out the best way to handle the indexes on a table in SQL Server.
I have a table that only needs to be read from. No real writing to the table (after the initial setup).
I have about 5-6 columns in the table that need to be indexed. Does it make more sense to setup one nonclustered index for the entire table and add all the columns that I need indexed to that index or should I set up multiple nonclustered indexes each with one column?
I am wondering which setup would have better read performance.
Any help on this would be great.
UPDATE:
There are some good answers already but I wanted to elaborate on my needs a little more.
There is one main table with auto records. I need to be able to perform very quick counts on over 100MM records. The where statements will vary but I am trying to index all of the possible columns in the where statement. So I will have queries like:
SELECT COUNT(recordID)
FROM tableName
WHERE zip IN (32801, 32802, 32803, 32809)
AND makeID = '32'
AND modelID IN (22, 332, 402, 504, 620)
or something like this:
SELECT COUNT(recordID)
FROM tableName
WHERE stateID = '9'
AND classCode IN (3,5,9)
AND makeID NOT IN (55, 56, 60, 80, 99)
So there is about 5-6 columns that could be in the where clause but it will vary a lot on which ones.
The fewer indexes you have - the better. Each index might speed up some queries - but it also incurs overhead and needs to be maintained. Not so bad if you don't write much to the table.
If you can combine multiple columns into a single index - perfect! But if you have a compound index on multiple columns, that index can only be used if you use/need the n left-most columns.
So if you have an index on (City, LastName, FirstName) like in a phone book - this works if you're looking for:
everyone in a given city
every "Smith" in "Boston"
every "Paul Smith" in "New York"
but it cannot be used to find all entries with first name "Paul" or all people with lastname of "Brown" in your entire table; the index can only be used if you also specify the City column
So therefore - compound indexes are beneficial and desirable - but only if you can really use them! Having just one index with your 6 columns does not help you at all, if you need to select the columns individually
Update: with your concrete queries, you can now start to design what indexes would help:
SELECT COUNT(recordID)
FROM tableName
WHERE zip IN (32801, 32802, 32803, 32809)
AND modelID = '32'
AND model ID IN (22, 332, 402, 504, 620)
Here, an index on (zip, modelID) would probably be a good idea - both zip and modelID are used in the where clause (together), and having the recordID in the index as well (as an Include(RecordID) clause) should help, too.
SELECT COUNT(recordID)
FROM tableName
WHERE stateID = '9'
AND classCode IN (3,5,9)
AND makeID NOT IN (55, 56, 60, 80, 99)
Again: based on the WHERE clause - create an index on (stateID, classCode, makeID) and possibly add Include(RecordID) so that the nonclustered index becomes covering (e.g. all the info needed for your query is in the nonclustered index itself - no need to go back to the "base" tables).
It depends on your access pattern
For a read only table, I'd most likely create multiple non-clustered indexes, each having multiple key columns to match WHERE clauses, and INCLUDEd columns for non-key columns
I would have neither one non-clustered for all nor one per column: they won't be useful.actual queries

SQL Server index - very large table with where clause against a very small range of values - do I need an index for the where clause?

I am designing a database with a single table for a special scenario I need to implement a solution for. The table will have several hundred million rows after a short time, but each row will be fairly compact. Even when there are a lot of rows, I need insert, update and select speeds to be nice and fast, so I need to choose the best indexes for the job.
My table looks like this:
create table dbo.Domain
(
Name varchar(255) not null,
MetricType smallint not null, -- very small range of values, maybe 10-20 at most
Priority smallint not null, -- extremely small range of values, generally 1-4
DateToProcess datetime not null,
DateProcessed datetime null,
primary key(Name, MetricType)
);
A select query will look like this:
select Name from Domain
where MetricType = #metricType
and DateProcessed is null
and DateToProcess < GETUTCDATE()
order by Priority desc, DateToProcess asc
The first type of update will look like this:
merge into Domain as target
using #myTablePrm as source
on source.Name = target.Name
and source.MetricType = target.MetricType
when matched then
update set
DateToProcess = source.DateToProcess,
Priority = source.Priority,
DateProcessed = case -- set to null if DateToProcess is in the future
when DateToProcess < DateProcessed then DateProcessed
else null end
when not matched then
insert (Name, MetricType, Priority, DateToProcess)
values (source.Name, source.MetricType, source.Priority, source.DateToProcess);
The second type of update will look like this:
update Domain
set DateProcessed = source.DateProcessed
from #myTablePrm source
where Name = source.Name and MetricType = #metricType
Are these the best indexes for optimal insert, update and select speed?
-- for the order by clause in the select query
create index IX_Domain_PriorityQueue
on Domain(Priority desc, DateToProcess asc)
where DateProcessed is null;
-- for the where clause in the select query
create index IX_Domain_MetricType
on Domain(MetricType asc);
Observations:
Your updates should use the PK
Why not use tinyint (range 0-255) to make the rows even narrower?
Do you need datetime? Can you use smalledatetime?
Ideas:
Your SELECT query doesn't have an index to cover it. You need one on (DateToProcess, MetricType, Priority DESC) INCLUDE (Name) WHERE DateProcessed IS NULL
`: you'll have to experiment with key column order to get the best one
You could extent that index to have a filtered indexes per MetricType too (keeping DateProcessed IS NULL filter). I'd do this after the other one when I do have millions of rows to test with
I suspect that your best performance will come from having no indexes on Priority and MetricType. The cardinality is likely too low for the indexes to do much good.
An index on DateToProcess will almost certainly help, as there is lilely to be high cardinality in that column and it is used in a WHERE and ORDER BY clause. I would start with that first.
Whether an index on DateProcessed will help is up for debate. That depends on what percentage of NULL values you expect for this column. Your best bet, as usual, is to examine the query plan with some real data.
In the table schema section, you have highlighted that 'MetricType' is one of two Primary keys, therefore this should definately be indexed along with the Name column. As for the 'Priority' and 'DateToProcess' fields as these will be present in a where clause it can't hurt to have them indexed also but I don't recommend the where clause you have on that index of 'DateProcessed' is null, indexing just a set of the data is not a good idea, remove this and index the whole of both those columns.

Resources