Altering/Editing an already partitioned table - sql-server

What steps to take to add additional partitions to the end of an already partitioned table in SQL Server?
Conditions:
The Partition Function is Right Range.
Table considers as a VLTB.
No DB downtime is acceptable (<10min).
Also, How to verify the partitions and rows are correctly mapped?

Addressing your questions in turn:
What steps to take to add additional partitions to the end of an already partitioned table in SQL Server?
Partitioned tables are built on partition schemes which themselves are built on partition functions. Partition functions explicitly specify partition boundaries which implicitly define the partitions. To add a new partition to the table, you need to alter the partition function to add a new partition boundary. The syntax for that is alter partition function... split. For example, let's say that you have an existing partition function on a datetime data type that defines monthly partitions.
CREATE PARTITION FUNCTION PF_Monthly(datetime)
AS RANGE RIGHT FOR VALUES (
'2022-10-01',
'2022-11-01',
'2022-12-01',
'2023-01-01'
);
Pausing there and talking about the last two partitions in the current setup. The next-to-last partition is defined as 2022-12-01 <= x < 2023-01-01 while the last partition is defined as 2023-01-01 <= x. Which is to say that the next-to-last partition is bounded for the month of December 2022, the last partition is unbounded on the high side and includes data for January 2023 but also anything larger.
If you want to bound the last partition to just January 2023, you'll add a partition boundary to the function for the high side of that partition. There's a small catch in that you'll also need to alter the partition scheme to tell SQL where to put data, but that's a small thing.
ALTER PARTITION SCHEME PS_Monthly
NEXT USED someFileGroup;
ALTER PARTITION FUNCTION PF_Monthly()
SPLIT RANGE ('2023-02-01');
At this point, what used to be your highest partition is now defined as 2023-01-01 <= x < 2023-02-01 and the highest partition is defined as 2023-02-01 <= x. I should note that adding a boundary to a partition function will affect all tables that use it. When I was using table partitioning at a previous job, I had a rule to have only one table using a given partition function (even if they were logically equivalent).
No DB downtime is acceptable (<10min)
The above exposition doesn't mention one important point - if there is data in either side of the new boundary, a new B-tree is going to be built for it (which is a size-of-data operation). There's more on that in the documentation. To keep that at a minimum, I like to keep two empty partitions at the end of the scheme. Using my above example, that would mean that I'd have added the January partition boundary in November. By doing it this way, you have some leeway in when the actual partition split happens (i.e. if it's a bit late, you're not accidentally incurring data movement). I'd also put in monitoring that's something along the lines of "if the highest partition boundary is less than 45 days away, alert". A slightly more sophisticated but more correct alert would be "if there is data in the second to last partition, send an alert".
Also, How to verify the partitions and rows are correctly mapped?
You can query the DMVs for this. I like using the script in this blog post. There's also the $PARTITION() function if you want to see which partition specific rows in your table belong to.

Related

Partiton scanning increases if the functions like UPPER is used

Hi I noticed that the partition scanning increases if we use functions like UPPER in the where clause. Although it is not required to use UPPER, but I wanted to know why it changes the behavior of partition scanning. It seems like if functions are used, it forces a lot more scanning of the partitions
SELECT *
FROM SCOPS_DB.TABLE1
WHERE YEAR =2015 and UPPER(COL1)='COLVAL';
SCANS 18,759 PARTITIONS
SELECT *
FROM SCOPS_DB.TABLE1
WHERE YEAR =2015 and COL1='COLVAL';
SCANS 1 PARTITION
Thanks
Rajib
Snowflake stores metadata about each column in a metadata services layer including: "The range of values for each of the columns in the micro-partition". To me, this is like having an index on each column pointing to the the corresponding micro-partitions in your table.
In your example, this range of values stored in the metadata layer for COL1 is stored it its raw form (probably a mix of upper and lower case). This means when you apply a function to the column, the metadata services layer cannot be used as an "index" to fetch the micro-partitions for the range of rows you need.
If you need to, you can create a cluster key to the column which does apply a function. In which case it'll be used when you apply that same function as a predicate.

Extend partition function and scheme in SQL Server

My SQL Server database includes some tables partitioned by month. The partition scheme and function are set to the 20191201 right limit. My partition scheme uses separate file groups for each partition. I now need to extend these before the end of the year (last partition key on the right is N'20191231' and last file group FG_2_201912).
Question #1: do I need to repeat ALTER PARTITION SCHEME [PartitionByPeriodScheme] NEXT USED [FG_2_202001]; for each file group until [FG_2_202012]? I sure can write a script which will produce the command dynamically but is there any way to add all file groups with one command?
Question #2: do I need to repeat ALTER PARTITION FUNCTION [PartitionByPeriodFunction]() SPLIT RANGE 20200131 for each partition key value until 20201231? Do I really need to split range since there are no data in the last right partition yet? Are there any alternatives?

SQL Server Table Partitioning Cannot drop Filegroup after Partition Switch

I have a huge table with around 110 partitions. I wish to archive the oldest partition and drop the FileGroup. Following is the strategy I adopted.
Created an exact empty table tablename_archive and met all partitioning requirements.
Perform Partition switch
ALTER TABLE tablename SWITCH PARTITION 1 TO tablename_archive PARTITION 1
After verifying the switch (partition swap) , I dropped the archived table.
Merged the Partition function using the first boundary value as follows
ALTER PARTITION FUNCTION YMDatePF2 () MERGE RANGE ('2012-01-01 00:00:00.000')
Although there is no data now on FG, when I try to drop the File or FG it errors out saying.
The file 'XXXXXXXX' cannot be removed because it is not empty.
The filegroup 'XXXXXXXX' cannot be removed because it is not empty.
Is there any change I need to make it to Partition scheme too, after merging the function.
Please let me know if you need any more details.
You can never remove the first (or only) partition from a RANGE RIGHT partition function (or conversely, the last (or only) partition of a RANGE LEFT function). The first (or last if RANGE LEFT) filegroup from the underlying partition schemes can never be removed from the schemes either. Remember you have one more partition, and partition scheme filegroup mapping, than partition boundaries.
If your intent was to archive January 2012 data, you should have switched partition 2 rather than 1 because the first partition contained data less than '2012-01-01 00:00:00.000'. Now that the second partition has been merged, the first partition (and the first filegroup) contains data less than '2012-02-01T00:00:00.000', which includes January 2012 data.
With a RANGE RIGHT sliding window, it is best to plan to keep the first filegroup empty. You could used the PRIMARY filegroup or a dummy one with no files for that purpose. See Table Partitioning Best Practices.

Should I normalize a database with a column for each day of the week?

Designing an oracle database for an ordering system. Each row will be a schedule that stores can be assigned that designates if/when they will order from a specific vendor for each day of the week.
It will be keyed by vendor id and a unique schedule id. Started out with those columns, and then a column for each day of the week like TIME_SUN, TIME_MON, TIME_TUE... to contain the order time for each day.
I'm normally inclined to try and normalize data and have another table referencing the schedule id, with a column like DAY_OF_WEEK and ORDER_TIME, so potentially 7 rows for the same data.
Is it really necessary for me to do this, or is it just over complicating what can be handled as a simple single row?
Normalization is the best way. Reasons:
The table will act as a master table
The table can be used for reference in future needs
It will be costly to normalize later
If there are huge number of rows with repeating more column values then database size growth is unwanted
Using master table will limit redundant data only to the foreign key
Normalization would be advisable. In future if you are required to store two or more order times for the same day then just adding rows in your vendor_day_order table will be required. In case you go with the first approach you will be required to make modifications to your table structure.

Approaches to table partitioning in SQL Server

The database I'm working with is currently over 100 GiB and promises to grow much larger over the next year or so. I'm trying to design a partitioning scheme that will work with my dataset but thus far have failed miserably. My problem is that queries against this database will typically test the values of multiple columns in this one large table, ending up in result sets that overlap in an unpredictable fashion.
Everyone (the DBAs I'm working with) warns against having tables over a certain size and I've researched and evaluated the solutions I've come across but they all seem to rely on a data characteristic that allows for logical table partitioning. Unfortunately, I do not see a way to achieve that given the structure of my tables.
Here's the structure of our two main tables to put this into perspective.
Table: Case
Columns:
Year
Type
Status
UniqueIdentifier
PrimaryKey
etc.
Table: Case_Participant
Columns:
Case.PrimaryKey
LastName
FirstName
SSN
DLN
OtherUniqueIdentifiers
Note that any of the columns above can be used as query parameters.
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
If you have no other choice you can partition by key module the number of partition tables.
Lets say that you want to partition to 10 tables.
You will define tables:
Case00
Case01
...
Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
Another possible thing to look at (before partitioning) is your model.
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.

Resources