I want to create a ranged partitioned table in an Azure SQL Database with rolling monthly partitions. So everything from a January (no matter what year) should be within one partition.
The table contains logging information for ETL processes and to ease the housekeeping I'd like to be able to truncate partitions from time to time.
In Oracle I would do it like this:
CREATE TABLE my_log (
log_id NUMBER PRIMARY KEY,
log_txt VARCHAR2(1000),
insert_date DATE
)
PARTITION BY RANGE(TO_CHAR(insert_date, 'MM')) (
partition m1 values less than ('02'),
partition m2 values less than ('03'),
partition m3 values less than ('04'),
partition m4 values less than ('05'),
partition m5 values less than ('06'),
partition m6 values less than ('07'),
partition m7 values less than ('08'),
partition m8 values less than ('09'),
partition m9 values less than ('10'),
partition m10 values less than ('11'),
partition m11 values less than ('12'),
partition m12 values less than ('13'),
partition mmax values less than (MAXVALUE)
);
And use a ALTER TABLE TRUNCATE PARTITION for housekeeping to get rid of everything older than let's say 4 months.
What I found out so far: if I create a partition function for the ranges, the column that contains the range must be part of the primary key. Is there any way to circumvent that?
This dos not work:
CREATE PARTITION FUNCTION logRangePF1 (int)
AS RANGE RIGHT FOR VALUES (1,2,3,4,5,6,7,8,9,10,11,12) ;
GO
CREATE PARTITION SCHEME logRangePS1
AS PARTITION logRangePF1
ALL TO ('PRIMARY') ;
GO
CREATE TABLE dbo.logPartitionTable (
log_id INT PRIMARY KEY ,
log_text nvarchar(1000),
insert_date datetime,
partition_column as datepart(month, insert_date) PERSISTED
)
ON logRangePS1 ( partition_column ) ;
GO
I appreciate any hint on how to archive this in an Azure SQL Database.
Thanks
Related
I want to partition first five days of the month.
Following is the way I achieved it.
CREATE PARTITION FUNCTION [pf_sampleTable](datetime) AS RANGE LEFT FOR VALUES (
N'2019-12-01T00:00:00.000'
, N'2019-12-02T00:00:00.000'
, N'2019-12-03T00:00:00.000'
, N'2019-12-04T00:00:00.000'
, N'2019-12-05T00:00:00.000'
)
GO
In this technique there should be an update operation to define new time frames every month.
I was wondering if we could use something like a wild card in the datetime fields.
CREATE PARTITION FUNCTION [pf_sampleTable](datetime) AS RANGE LEFT FOR VALUES (
N'%-01T00:00:00.000'
, N'%-02T00:00:00.000'
, N'%-03T00:00:00.000'
, N'%-04T00:00:00.000'
, N'%-05T00:00:00.000'
)
GO
CREATE PARTITION FUNCTION DDL creates static partitions. Although one can specify expressions for the partition boundaries that are evaluated when the statement is run, these are not evaluated afterwards. It is necessary to ALTER the function to create or drop partitions after creation.
Consider scheduling a daily job to execute the needed script (and perhaps remove old partitions) as desired.
I suggest a RANGE RIGHT function when partitioning on temporal types that have a time component so that values that are exactly midnight don't end up in the wrong partition. The example below will create future date partitions 2 days in advance to avoid expensive data movement when splitting partitions.
--initial 5 boundaries
CREATE PARTITION FUNCTION PF_DateTime(datetime) AS
RANGE RIGHT FOR VALUES(
NULL -- (dates outside expected range)
, N'2019-12-01T00:00:00.000'
, N'2019-12-02T00:00:00.000'
, N'2019-12-03T00:00:00.000'
, N'2019-12-04T00:00:00.000'
, N'2019-12-05T00:00:00.000'
);
CREATE PARTITION SCHEME PS_DateTime AS
PARTITION PF_DateTime ALL TO ([PRIMARY]);
--run this after midnight on 2019-12-04 to create the 2019-12-06 boundary
ALTER PARTITION SCHEME PS_DateTime
NEXT USED [PRIMARY];
ALTER PARTITION FUNCTION PF_DateTime()
SPLIT RANGE(CAST(DATEADD(day, 2, GETDATE()) AS date));
See table partitioning best practices for more information.
I wonder why Oracle Databases require that at least single partition is defined when creating PARTITION BY RANGE INTERVAL
This is correct:
CREATE TABLE FOO (
bar VARCHAR2(10),
creation_date timestamp(6) not null
)
PARTITION BY RANGE (creation_date) INTERVAL (NUMTODSINTERVAL(1,'DAY')) (
PARTITION part_01 values LESS THAN (TO_DATE('01-03-2018','DD-MM-YYYY'))
)
This however not:
CREATE TABLE FOO (
bar VARCHAR2(10),
creation_date timestamp(6) not null
)
PARTITION BY RANGE (creation_date) INTERVAL (NUMTODSINTERVAL(1,'DAY'))
I would expect that the first partition would be required in some migration case but not when creating a new table.
Oracle documentation about that:
The INTERVAL clause of the CREATE TABLE statement establishes interval partitioning for the table. You must specify at least one range partition using the PARTITION clause.
https://docs.oracle.com/cd/E11882_01/server.112/e25523/part_admin001.htm#BAJHFFBE
Without default interval Oracle does not know where to start the interval. For daily partition it is not so obvious but imagine you have one partition per week, i.e. 7 days.
Shall it be Monday-Monday or Sunday-Sunday or something else?
What does an interval of "1 DAY" mean? From 00:00:00 - 23:59:59 (as implicitly given in your example) or something else, for example 12:00:00 - 11:59:59 (which would be PARTITION part_01 values LESS THAN (TO_DATE('01-03-2018 12:00','DD-MM-YYYY HH24:MI')))
I am pretty new to table partitioning technique supported by MS SQL server. I have a huge table that has more than 40 millions of records and want to apply table partitioning to this table. Most of the examples I find about the partition function is to define the partition function as Range LEFT|RIGHT for Values(......), but what I need exactly is to something like following example I found from Oracle web page:
CREATE TABLE q1_sales_by_region
(...,
...,
...,
state varchar2(2))
PARTITION BY LIST (state)
(PARTITION q1_northwest VALUES ('OR', 'WA'),
PARTITION q1_southwest VALUES ('AZ', 'UT', 'NM'),
PARTITION q1_northeast VALUES ('NY', 'VM', 'NJ'),
PARTITION q1_southeast VALUES ('FL', 'GA'),
PARTITION q1_northcentral VALUES ('SD', 'WI'),
PARTITION q1_southcentral VALUES ('OK', 'TX'));
);
The example shows that we can specify a PARTITION BY LIST clause in the CREATE TABLE statement, and the PARTITION clauses specify lists of discrete values that qualify rows to be included in the partition.
My question is does MS SQL server support table partitioning by List as well?
It does not. SQL Server's partitioned tables only support range partitioning.
In this circumstance, you may wish instead to consider using a Partitioned View.
There are a number of restrictions (scroll down slightly from the link anchor) that apply to partitioned views but the key here is that the partitioning is based on CHECK constraints within the underlying tables and one form the CHECK can take is <col> IN (value_list).
However, setting up partitioned views is considerably more "manual" than creating a partitioned table - each table that holds some of the view data has to be individually and explicitly created.
You can achieve this by using ausillary computed persisted column.
Here you can find a complete example:
LIST Partitioning in SQL Server
The idea is to create a computed column based on your list like this:
alter table q1_sales_by_region add calc_field (case when q1_northwest in ('OR', 'WA') then 1...end) PERSISTED
And then partition on this calc_field using standard range partition function
What are you trying to accomplish with partitioning? 40M rows was huge 20 years ago but commonplace nowadays. Index and query tuning is especially important for performance of large tables, although partitioning can improve performance of large scans when the partitioning column is not the leftmost clustered index key column and partitions can be eliminated during query processing.
For improved manageability and control over physical placement on different filegroups, you can use range partitioning with a filegroup per region. For example:
CREATE TABLE q1_sales_by_region
(
--
state char(2)
);
CREATE PARTITION FUNCTION PF_State(char(2)) AS RANGE RIGHT FOR VALUES(
'AZ'
, 'FL'
, 'GA'
, 'NJ'
, 'NM'
, 'NY'
, 'OK'
, 'OR'
, 'SD'
, 'TX'
, 'UT'
, 'VM'
, 'WA'
, 'WI'
);
CREATE PARTITION SCHEME PS_State AS PARTITION PF_State TO(
[PRIMARY] --unused
, q1_southwest --'AZ'
, q1_southeast --'FL'
, q1_southeast --'GA'
, q1_northeast --'NJ'
, q1_southwest --'NM'
, q1_northeast --'NY'
, q1_southcentral --'OK'
, q1_northwest --'OR'
, q1_northcentral --'SD'
, q1_southcentral --'TX'
, q1_southwest --'UT'
, q1_northeast --'VM'
, q1_northwest --'WA'
, q1_northcentral --'WI'
);
You can also add a check constraint if you don't already have a related table to enforce only valid state values:
ALTER TABLE q1_sales_by_region
ADD CONSTRAINT ck_q1_sales_by_region_state
CHECK (state IN('OR', 'WA', 'AZ', 'UT', 'NM','NY', 'VM', 'NJ','FL', 'GA','SD', 'WI','OK', 'TX'));
The schema I'm working on has a small amount of customers, with lots of data per customer.
In determining a partitioning strategy, my first thought was to partition by customer_id and then subpartition by range with a day interval. However you cannot use interval in subpartitions.
Ultimately I would like a way to automatically create partitions for new customers as they are created, and also have automatic daily subpartitions created for the customers' data. All application queries are at the customer_id level with various date ranges specified.
This post is nearly identical, but the answer involves reversing the partitioning strategy, and I would still like to find a way to accomplish range-range interval partitioning. One way could potentially be to have a monthly database job to create subpartitions for the days/months ahead, but that doesn't feel right.
Perhaps I'm wrong on my assumptions that the current data structure would benefit more from a range-range interval partitioning strategy. We have a few customers whose data dwarfs other customers, so I was thinking of ways to isolate customer data.
Any thoughts/suggestions on a better approach?
Thank you again!
UPDATE
Here is an example of what I was proposing:
CREATE TABLE PART_TEST(
CUSTOMER_ID NUMBER,
LAST_MODIFIED_DATE DATE
)
PARTITION BY RANGE (CUSTOMER_ID)
INTERVAL (1)
SUBPARTITION BY RANGE (LAST_MODIFIED_DATE)
SUBPARTITION TEMPLATE
(
SUBPARTITION subpart_1206_min values LESS THAN (TO_DATE('12/2006','MM/YYYY')),
SUBPARTITION subpart_0107 values LESS THAN (TO_DATE('01/2007','MM/YYYY')),
SUBPARTITION subpart_0207 values LESS THAN (TO_DATE('02/2007','MM/YYYY')),
...
...
...
SUBPARTITION subpart_max values LESS THAN (MAXVALUE)
)
(
PARTITION part_1 VALUES LESS THAN (1)
)
I currently have 290 subpartitions in the template. This appears to be working except for one snag. In my tests I'm finding that any record with a CUSTOMER_ID greater than 3615 fails with ORA-14400: inserted partition key does not map to any partition
You can make a RANGE INTERVAL partition on date and then LIST or RANGE subpartition on it. Would be like this:
CREATE TABLE MY_PART_TABLE
(
CUSTOMER_ID NUMBER NOT NULL,
THE_DATE TIMESTAMP(0) NOT NULL,
OTHER_COLUMNS NUMBER
)
PARTITION BY RANGE (THE_DATE) INTERVAL (INTERVAL '1' MONTH)
SUBPARTITION BY RANGE (CUSTOMER_ID)
SUBPARTITION TEMPLATE (
SUBPARTITION CUSTOMER_GROUP_1 VALUES LESS THAN (10),
SUBPARTITION CUSTOMER_GROUP_2 VALUES LESS THAN (20),
SUBPARTITION CUSTOMER_GROUP_3 VALUES LESS THAN (30),
SUBPARTITION CUSTOMER_GROUP_4 VALUES LESS THAN (40),
SUBPARTITION CUSTOMER_GROUP_5 VALUES LESS THAN (MAXVALUE)
)
(PARTITION VALUES LESS THAN ( TIMESTAMP '2015-01-01 00:00:00') );
CREATE TABLE MY_PART_TABLE
(
CUSTOMER_ID NUMBER NOT NULL,
THE_DATE TIMESTAMP(0) NOT NULL,
OTHER_COLUMNS NUMBER
)
PARTITION BY RANGE (THE_DATE) INTERVAL (INTERVAL '1' MONTH)
SUBPARTITION BY LIST (CUSTOMER_ID)
SUBPARTITION TEMPLATE (
SUBPARTITION CUSTOMER_1 VALUES (1),
SUBPARTITION CUSTOMER_2 VALUES (2),
SUBPARTITION CUSTOMER_3_to_6 VALUES (3,4,5,6),
SUBPARTITION CUSTOMER_7 VALUES (7)
)
(PARTITION VALUES LESS THAN ( TIMESTAMP '2015-01-01 00:00:00') );
Note, for the second solution the number (i.e. ID's) of customer is fix. If you get new customers you have to alter the table and modify the SUBPARTITION TEMPLATE accordingly.
Monthly partitions will be created automatically by Oracle whenever new values are inserted or updated.
This query runs very fast (<100 msec):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
If I add just a time filter, it takes too long (22 seconds!):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].Time > '2015-04-10'
AND [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
I tried adding an index on the [Time] column of the Sms table, but the optimizer seems not using the index. Tried using With (index (Ix_Sms_Time)); but to my surprise, it takes even more time (29 seconds!).
Here is the actual execution plan:
The execution plan is same for both queries. Tables mentioned here have 5M to 8M rows (indices are < 1% fragmented and stats are updated). I am using MS SQL Server 2008R2 on a 16core 32GB memory Windows 2008 R2 machine)
Does it help when you force the time filter to kick in only after the client filter has run?
FI like in this example:
;WITH ClientData AS (
SELECT
[E2].[CompanyId]
,[E2].[Time]
,[E1].[Id]
,[E1].[Status]
FROM [dbo].[SplittedSms] AS [E1]
INNER JOIN [dbo].[Sms] AS [E2]
ON [E1].[SmsId] = [E2].[Id]
WHERE [E2].[CompanyId] = 4563
AND ([E1].[NotifiedToClient] IS NULL)
)
SELECT TOP 10
[CompanyId]
,[Id]
,[Status]
FROM ClientData
WHERE [Time] > '2015-04-10'
Create an index on Sms with the following Index Key Columns (in this order):
CompanyID
Time
You may or may not need to add Id as an Included Column.
What datatype is your Time column?
If it's datetime, try converting your '2015-04-10' into equivalent data-type, so that it can use the index.
Declare #test datetime
Set #test='2015-04-10'
Then modify your condition:
[Extent2].Time > #test
The sql server implicitly casts to matching data-type if there is a data-type mismatch. And any function or cast operation prevent using indexes.
I'm on the same track with #JonTirjan, the index with just Time results into a lot of key lookups, so you should try at least following:
create index xxx on Sms (Time, CompanyId) include (Id)
or
create index xxx on Sms (CompanyId, Time) include (Id)
If Id is your clustered index, then it's not needed in include clause. If significant part of your data belongs to CompanyID 4563, it might be ok to have it as include column too.
The percentages you see in actual plan are just estimates based on the row count assumptions, so those are sometimes totally wrong. Looking at actual number of rows / executions + statistics IO output should give you idea what's actually happening.
Two things come to mind:
By adding an extra restriction it will be 'harder' for the database to find the first 10 items that match your restrictions. Finding the first 10 rows from let's say 10.000 items (from a total of 1 milion) is a easier then finding the first 10 rows from maybe 100 items (from a total of 1 milion).
The index is not being used probably because the index is created on a datetime column, which is not very efficient if you are also storing the time in them. You might want to create a clustered index on the [time] column (but then you would have to remove the clustered index which is now on the [CompanyId] column or you could create a computed column which stores the date-part of the [time] column, create an index on this computed column and filter on this column.
I found out that there was no index on the foreign key column (SmsId) on the SplittedSms table. I made one and it seems the second query is almost as fast as the first one now.
The execution plan now:
Thanks everyone for the effort.