PostgreSQL multi-layer partitioning - database

I have been using partitioning with a postgreSQL database for a while. My database has grown quite a lot and does so nicely with partitioning. Unfortunately I now seem to have hit another barrier in speed and am trying to figure out some ways to speed up the database even more.
My basic setup is as follows:
I have one master table called database_data from which all the partitions inherit. I chose to have one partition per month and name them like: database_data_YYYY_MM which works nicely.
By analyzing my data usage, I noticed, that I mostly do insert operations on the table and only some updates. The updates, however also occur on only a certain kind of row: I have a column called channel_id (a FK to another table). The rows I update always have a channel_id out of a set of maybe 50 IDs, so this would be a great way of distinguishing the rows that are never updated from the ones that potentially are.
I figured it would speed up my setup further if I would use the partitioning to have one table of insert only data and one of potentially updated data per month, as my updates would have to check less rows each time.
I could of course use the "simple" partitioning I am using now and add another table for each month called database_data_YYYY_MM_update and add the special constraints to that and the database_data_YYYY_MM table in order for the query planner to distinguish between the tables.
I was, however thinking, that I do sometimes have operations which operate on all data of a given month, no matter if updateable or not. In such a case I could JOIN the two tables but there could be an easier way for such queries.
So now to my real question:
Is "two layer" partitioning possible in PostgreSQL? What I mean by that is, that instead of having two tables for each month inheriting from the master table, I would only have one table per month directly inheriting from the master table e.g. database_data_YYYY_MM and then have two more tables inheriting from that table, one for the insert only data e.g. database_data_YYYY_MM_insert and one for the updateable data e.g. database_data_YYYY_MM_update.
Would this speed up the query planning at all? I would guess that it would be faster if the query planner could eliminate both tables at once if the intermediate table was eliminated.
The obvious advantage here would be that I could operate on all data of one month by simply using the table database_data_YYYY_MM and for my updates use the child table directly.
Any drawbacks that I am not thinking of?
Thank you for your thoughts.
Edit 1:
I don't think a schema is strictly necessary to answer my question but if it helps understanding I'll provide a sample schema:
CREATE TABLE database_data (
id bigint PRIMARY KEY,
channel_id bigint, -- This is a FK to another table
timestamp TIMESTAMP WITH TIME ZONE,
value DOUBLE PRECISION
)
I have a trigger on the database_data table that generates the partitions on demand:
CREATE OR REPLACE FUNCTION function_insert_database_data() RETURNS TRIGGER AS $BODY$
DECLARE
thistablename TEXT;
thisyear INTEGER;
thismonth INTEGER;
nextmonth INTEGER;
nextyear INTEGER;
BEGIN
-- determine year and month of timestamp
thismonth = extract(month from NEW.timestamp AT TIME ZONE 'UTC');
thisyear = extract(year from NEW.timestamp AT TIME ZONE 'UTC');
-- determine next month for timespan in check constraint
nextyear = thisyear;
nextmonth = thismonth + 1;
if (nextmonth >= 13) THEN
nextmonth = nextmonth - 12;
nextyear = nextyear +1;
END IF;
-- Assemble the tablename
thistablename = 'database_datanew_' || thisyear || '_' || thismonth;
-- We are looping until it's successfull to catch the case when another connection simultaneously creates the table
-- if that would be the case, we can retry inserting the data
LOOP
-- try to insert into table
BEGIN
EXECUTE 'INSERT INTO ' || quote_ident(thistablename) || ' SELECT ($1).*' USING NEW;
-- Return NEW inserts the data into the main table allowing insert statements to return the values like "INSERT INTO ... RETURNING *"
-- This requires us to use another trigger to delete the data again afterwards
RETURN NEW;
-- If the table does not exist, create it
EXCEPTION
WHEN UNDEFINED_TABLE THEN
BEGIN
-- Create table with check constraint on timestamp
EXECUTE 'CREATE TABLE ' || thistablename || ' (CHECK ( timestamp >= TIMESTAMP WITH TIME ZONE '''|| thisyear || '-'|| thismonth ||'-01 00:00:00+00''
AND timestamp < TIMESTAMP WITH TIME ZONE '''|| nextyear || '-'|| nextmonth ||'-01 00:00:00+00'' ), PRIMARY KEY (id)
) INHERITS (database_data)';
-- Add any trigger and indices to the table you might need
-- Insert the new data into the new table
EXECUTE 'INSERT INTO ' || quote_ident(thistablename) || ' SELECT ($1).*' USING NEW;
RETURN NEW;
EXCEPTION WHEN DUPLICATE_TABLE THEN
-- another thread seems to have created the table already. Simply loop again.
END;
-- Don't insert anything on other errors
WHEN OTHERS THEN
RETURN NULL;
END;
END LOOP;
END;
$BODY$
LANGUAGE plpgsql;
CREATE TRIGGER trigger_insert_database_data
BEFORE INSERT ON database_data
FOR EACH ROW EXECUTE PROCEDURE function_insert_database_data();
As for sample data: Let's assume we only have two channels: 1 and 2. 1 is insert only data and 2 is updateable.
My two layer approach would be something like:
Main table:
CREATE TABLE database_data (
id bigint PRIMARY KEY,
channel_id bigint, -- This is a FK to another table
timestamp TIMESTAMP WITH TIME ZONE,
value DOUBLE PRECISION
)
Intermediate table:
CREATE TABLE database_data_2015_11 (
(CHECK ( timestamp >= TIMESTAMP WITH TIME ZONE '2015-11-01 00:00:00+00' AND timestamp < TIMESTAMP WITH TIME ZONE '2015-12-01 00:00:00+00)),
PRIMARY KEY (id)
) INHERITS(database_data);
Partitions:
CREATE TABLE database_data_2015_11_insert (
(CHECK (channel_id = 1)),
PRIMARY KEY (id)
) INHERITS(database_data_2015_11);
CREATE TABLE database_data_2015_11_update (
(CHECK (channel_id = 2)),
PRIMARY KEY (id)
) INHERITS(database_data_2015_11);
Of course I would then need another trigger on the intermediate table to create the child tables on demand.

It's a clever idea, but sadly it doesn't seem to work. If I have a parent table with 1000 direct children, and I run a SELECT that should pull from just one child, then explain analyze gives me a planning time of around 16ms. On the other hand, if I have just 10 direct children, and they all have 10 children, and those all have 10 children, I get a query planning time of about 29ms. I was surprised---I really thought it would work!
Here is some ruby code I used to generate my tables:
0.upto(999) do |i|
if i % 100 == 0
min_group_id = i
max_group_id = min_group_id + 100
puts "CREATE TABLE datapoints_#{i}c (check (group_id > #{min_group_id} and group_id <= #{max_group_id})) inherits (datapoints);"
end
if i % 10 == 0
min_group_id = i
max_group_id = min_group_id + 10
puts "CREATE TABLE datapoints_#{i}x (check (group_id > #{min_group_id} and group_id <= #{max_group_id})) inherits (datapoints_#{i / 100 * 100}c);"
end
puts "CREATE TABLE datapoints_#{i + 1} (check (group_id = #{i + 1})) inherits (datapoints_#{i / 10 * 10}x);"
end

Related

SQL Server trigger: limit record insert

I'm doing a database for a university work. At certain point we have to create a trigger/function that can limit insertion of records on my SQL Server Management Studio 17, taking into account different foreign keys. For example we can have 100 records, but we can only have 50 records for the same foreign key.
We have this constraint:
CREATE TABLE [dbo].[DiaFerias] WITH CHECK
ADD CONSTRAINT [Verificar22DiasFerias]
CHECK (([dbo].[verificarDiasFerias]((22)) ='True'))
With the help of this function:
ALTER FUNCTION [dbo].[verificarDiasFerias] (#contagem INT)
RETURNS VARCHAR(5)
AS
BEGIN
IF EXISTS (SELECT DISTINCT idFuncionario, idDiaFerias
FROM DiaFerias
GROUP BY idFuncionario, idDiaFerias
HAVING COUNT(*) <= #contagem)
RETURN 'True'
RETURN 'False'
END
Would I use a trigger here? Reluctantly, yes (See Enforce maximum number of child rows for how to do it without triggers, in a way I'd never recommend). Would I use that function? No.
I'd create an indexed view:
CREATE VIEW dbo.DiaFerias_Counts
WITH SCHEMABINDING
AS
SELECT idFuncionario, idDiaFerias, COUNT_BIG(*) as Cnt
FROM dbo.DiaFerias
GROUP BY idFuncionario, idDiaFerias
GO
CREATE UNIQUE CLUSTERED INDEX PK_DiaFerias_Counts on
dbo.DiaFerias_Counts (idFuncionario, idDiaFerias)
Why do this? So that SQL Server maintains these counts for us automatically, so we don't have to write a wide-ranging query in the triggers. We can now write the trigger, something like:
CREATE TRIGGER T_DiaFerias
ON dbo.DiaFerias
AFTER INSERT, UPDATE
AS
SET NOCOUNT ON;
IF EXISTS (
SELECT
*
FROM dbo.DiaFerias_Counts dfc
WHERE
dfc.Cnt > 22
AND
(EXISTS (select * from inserted i
where i.idFuncionario = dfc.idFuncionario AND i.idDiaFerias = dfc.idDiaFerias)
OR EXISTS (select * from deleted d
where d.idFuncionario = dfc.idFuncionario AND d.idDiaFerias = dfc.idDiaFerias)
)
)
BEGIN
RAISERROR('Constraint violation',16,1)
END
Hopefully you can see how it's meant to work - we only want to query counts for items that may have been affected by whatever caused us to trigger - so we use inserted and deleted to limit our search.
And we reject the change if any count is greater than 22, unlike your function which only starts rejecting rows if every count is greater than 22.

How can I close an auction when the closing time is reached?

I am developing a real-time auction site for a school project. We can't make any changes to the design of the database.
The 'Item' table has a column for the expiration date (the day the auction expires) and the expiration time (the exact time at which the auction expires). It also has a column that indicates whether the auction is opened or closed. This [AuctionClosed?] column needs to be updated when the expiration date and time are reached, which has to happen in real-time.
We're using an SQL Server database and the website runs on PHP7. The only possible solution I've found is to run a job every second, but this is too much overhead.
This is the function I want to use to check the column:
CREATE FUNCTION dbo.fn_isAuctionClosed (#Item BIGINT)
RETURNS BIT
AS
BEGIN
DECLARE #expirationDay DATE = (SELECT expirationDate FROM Item WHERE itemId = #Item)
DECLARE #expirationTime TIME = (SELECT expirationTime FROM Item WHERE itemId = #Item)
IF
DATE(GETDATE()) = #expirationDay AND TIME(GETDATE()) >= #expirationTime
OR
DATE(GETDATE()) > #expirationDay
RETURN 1
RETURN 0
END
And this is the procedure that updates the column:
CREATE PROCEDURE updateAuctionClosed #Item BIGINT
AS
UPDATE Item
SET [AuctionClosed?] = fn_isAuctionClosed(#Item)
WHERE itemId = #Item
To be more specific, what you really want here is a calculated column. Like I said in the comments, as the column will rely on the current date and time, the column won't be deterministic. This means it can't be PERSISTED but would be calculated every time the column is referenced (A PERSISTED column actually has it's value stored and is calculated when the row is effected in some way and restored). Even so, it can be calculated as follows:
ALTER TABLE Item DROP COLUMN [AuctionClosed?]; --You can't alter a column to a computed column, so we have to DROP it first
ALTER TABLE Item ADD [AuctionClosed?] AS CASE WHEN CONVERT(datetime,expirationDate) + CONVERT(datetime, expirationTime) > GETDATE() THEN 1 ELSE 0 END;
On a side note, I recommend against special characters in an object's name. Stick to alphanumerical characters only, and (if you must) underscores (_), as these don't force the object to be delimit identified.

Cursor based approach vs set based approach

I need to optimize a slow running stored procedure that uses a cursor-based approach to a set-based approach.
In principle, I have to compare records from a transient table (up to 300 records) against records from a "master table" (approx. half a million records and steadily growing). The matching is to be performed by comparing 20 varchar(11) columns of the two records. If at least 6 of these columns between the two records match (i.e. same data) that is considered to be a "sufficient match" and a record is to be created into a match table storing the IDs of the transient record and the master record, the total number of matches and the total number of mismatches.
Note that the number of mismatches is not equal to the balance of 20 minus the number-of-matches. That's because if any of the columns in either of the two records contains a null, it is not counted as a match nor a mismatch; it is simply ignored. Thus the need to capture the two counts (business requirement).
The current implementation uses an outer FAST_FORWARD cursor for the master table and an inner FAST_FORWARD cursor for the transient table. Within the inner cursor it has the following simple comparison logic applied to the 20 columns:
IF #newResults.data1 IS NOT NULL AND #results.data1 IS NOT NULL
BEGIN
IF #newResults.data1 = #results.data1
SET #matchCount = #matchCount + 1
ELSE
SET #mismatchCount = #mismatchCount + 1
END
Then, if the total number of matching columns (i.e. #matchCount) is >= 6, a "match record" is written to a "match table" capturing the primary keys of the two records and the number of matches and mismatches.
What I'm hoping to achieve: rather than looping through the two nested cursors and process one record at a time, use a set-based implementation to process the above. One simple solution, I could think of, would be to do an:
INSERT INTO MatchingResults (ResultID, NewResultID, matchCount, mismatchCount)
SELECT (...) WHERE (...)
...and put the whole matching enchilada in the SELECT statement. But, this is the difficult part... Would anyone be able to give me some pointers here? Or suggest a better performing solution? Many thanks!
Updated with table structures:
--
-- Transient table:
--
table NewResults
(
NewResultID int identity(1,1),
Data1 varchar(11),
Data2 varchar(11),
...
Data20 varchar(11),
SampleDate datetime
)
--
-- Master table:
--
table Results
(
ResultID int identity(1,1),
Data1 varchar(11),
Data2 varchar(11),
...
Data20 varchar(11),
SampleDate datetime
)
--
-- Match table:
--
table MatchingResults
(
ResultID int,
NewResultID int,
MatchCount int,
MismatchCount int
)

SQL Server contraints for date ranges

I am trying to constrain a SQL Server Database by a Start Date and End Date such that I can never double book a resource (i.e. no overlapping or duplicate reservations).
Assume my resources are numbered such that the table looks like
ResourceId, StartDate, EndDate, Status
So lets say I have resource #1. I want to make sure that I cannot have have the a reservation for 1/8/2017 thru 1/16/2017 and a separate reservation for 1/10/2017 - 1/18/2017 for the same resource.
A couple of more complications, a StartDate for a resource can be the same as the EndDate for the resource. So 1/8/1027 thru 1/16/2017 and 1/16/2017 thru 1/20/2017 is ok (i.e., one person can check in on the same day another person checkouts).
Furthermore, the Status field indicates whether the booking of the resource is Active or Cancelled. So we can ignore all cancelled reservations.
We have protected against these overlapping or double booking reservations in Code (Stored Procs and C#) when saving but we are hoping to add an extra layer of protection by adding a DB Contraint.
Is this possible in SQL Server ?
Thanks in Advance
You can use a CHECK constraint to make sure startdate is on or before EndDate easily enough:
CONSTRAINT [CK_Tablename_ValidDates] CHECK ([EndDate] >= [StartDate])
A constraint won't help with preventing an overlapping date range. You can instead use a TRIGGER to enforce this by creating a FOR INSERT, UPDATE trigger that rolls back the transaction if it detects a duplicate:
CREATE TRIGGER [TR_Tablename_NoOverlappingDates] FOR INSERT, UPDATE AS
IF EXISTS(SELECT * from inserted INNER JOIN [MyTable] ON blah blah blah ...) BEGIN
ROLLBACK TRANSACTION;
RAISERROR('hey, no overlapping date ranges here, buddy', 16, 1);
RETURN;
END
Another option is to create a indexed view that finds duplicates and put a unique constraint on that view that will be violated if more than 1 record exists. This is usually accomplished with a dummy table that has 2 rows cartesian joined to an aggregate view that selects the duplicate id-- thus one record with a duplicate would return two rows in the view with the same fake id value that has a unique index.
I've done both, I like the trigger approach better.
Drawing from this answer here: Date range overlapping check constraint.
First, check to make sure there are not existing overlaps:
select *
from dbo.Reservation as r
where exists (
select 1
from dbo.Reservation i
where i.PersonId = r.PersonId
and i.ReservationId != r.ReservationId
and isnull(i.EndDate,'20990101') > r.StartDate
and isnull(r.EndDate,'20990101') > i.StartDate
);
go
If it is all clear, then create your function.
There are a couple of different ways to write the function, e.g. we could skip the StartDate and EndDate and use something based only on ReservationId like the query above, but I will use this as the example:
create function dbo.udf_chk_Overlapping_StartDate_EndDate (
#ResourceId int
, #StartDate date
, #EndDate date
) returns bit as
begin;
declare #r bit = 1;
if not exists (
select 1
from dbo.Reservation as r
where r.ResourceId = #ResourceId
and isnull(#EndDate ,'20991231') > r.StartDate
and isnull(r.EndDate,'20991231') > #StartDate
and r.[Status] = 'Active'
group by r.ResourceId
having count(*)>1
)
set #r = 0;
return #r;
end;
go
Then add your constraint:
alter table dbo.Reservation
add constraint chk_Overlapping_StartDate_EndDate
check (dbo.udf_chk_Overlapping_StartDate_EndDate(ResourceId,StartDate,EndDate)=0);
go
Last: Test it.

Generating Unique Random Numbers Efficiently

We are using the technique outlined here to generate random record IDs without collisions. In short, we create a randomly-ordered table of every possible ID, and mark each record as 'Taken' as it is used.
I use the following Stored Procedure to obtain an ID:
ALTER PROCEDURE spc_GetId #retVal BIGINT OUTPUT
AS
DECLARE #curUpdate TABLE (Id BIGINT);
SET NOCOUNT ON;
UPDATE IdMasterList SET Taken=1
OUTPUT DELETED.Id INTO #curUpdate
WHERE ID=(SELECT TOP 1 ID FROM IdMasterList WITH (INDEX(IX_Taken)) WHERE Taken IS NULL ORDER BY SeqNo);
SELECT TOP 1 #retVal=Id FROM #curUpdate;
RETURN;
The retrieval of the ID must be an atomic operation, as simultaneous inserts are possible.
For large inserts (10+ million), the process is quite slow, as I must pass through the table to be inserted via a cursor.
The IdMasterList has a schema:
SeqNo (BIGINT, NOT NULL) (PK) -- sequence of ordered numbers
Id (BIGINT) -- sequence of random numbers
Taken (BIT, NULL) -- 1 if taken, NULL if not
The IX_Taken index is:
CREATE NONCLUSTERED INDEX (IX_Taken) ON IdMasterList (Taken ASC)
I generally populate a table with Ids in this manner:
DECLARE #recNo BIGINT;
DECLARE #newId BIGINT;
DECLARE newAdds CURSOR FOR SELECT recNo FROM Adds
OPEN newAdds;
FETCH NEXT FROM newAdds INTO #recNo;
WHILE ##FETCH_STATUS=0 BEGIN
EXEC spc_GetId #newId OUTPUT;
UPDATE Adds SET id=#newId WHERE recNo=#recNo;
FETCH NEXT FROM newAdds INTO #id;
END;
CLOSE newAdds;
DEALLOCATE newAdds;
Questions:
Is there any way I can improve the SP to extract Ids faster?
Would a conditional index improve peformance (I've yet to test, as
IdMasterList is very big)?
Is there a better way to populate a table with these Ids?
As with most things in SQL Server, if you are using cursors, you are doing it wrong.
Since you are using SQL Server 2012, you can use a SEQUENCE to keep track of what random value you already used and effectively replace the Taken column.
CREATE SEQUENCE SeqNoSequence
AS bigint
START WITH 1 -- Start with the first SeqNo that is not taken yet
CACHE 1000; -- Increase the cache size if you regularly need large blocks
Usage:
CREATE TABLE #tmp
(
recNo bigint,
SeqNo bigint
)
INSERT INTO #tmp (recNo, SeqNo)
SELECT recNo,
NEXT VALUE FOR SeqNoSequence
FROM Adds
UPDATE Adds
SET id = m.id
FROM Adds a
INNER JOIN #tmp tmp ON a.recNo = tmp.recNo
INNER JOIN IdMasterList m ON tmp.SeqNo = m.SeqNo
SEQUENCE is atomic. Subsequent calls to NEXT VALUE FOR SeqNoSequence are guaranteed to return unique values, even for parallel processes. Note that there can be gaps in SeqNo, but it's a very small trade off for the huge speed increase.
Put a PK inden of BigInt on each table
insert into user (name)
values ().....
update user set = user.ID = id.ID
from id
left join usr
on usr.PK = id.PK
where user.ID = null;
one
insert into user (name) value ("justsaynotocursor");
set #PK = select select SCOPE_IDENTITY();
update user set ID = (select ID from id where PK = #PK);
Few ideas that came to my mind:
Try if removing the top, inner select etc. helps to improve the performance of the ID fetching (look at statistics io & query plan):
UPDATE top(1) IdMasterList
SET #retVal = Id, Taken=1
WHERE Taken IS NULL
Change the index to be a filtered index, since I assume you don't need to fetch numbers that are taken. If I remember correctly, you can't do this for NULL values, so you would need to change the Taken to be 0/1.
What actually is your problem? Fetching single IDs or 10+ million IDs? Is the problem CPU / I/O etc. caused by the cursor & ID fetching logic, or are the parallel processes being blocked by other processes?
Use sequence object to get the SeqNo. and then fetch the Id from idMasterList using the value returned by it. This could work if you don't have gaps in IdMasterList sequences.
Using READPAST hint could help in blocking, for CPU / I/O issues, you should try to optimize the SQL.
If the cause is purely the table being a hotspot, and no other easy solutions seem to help, split it into several tables and use some kind of simple logic (even ##spid, rand() or something similar) to decide from which table the ID should be fetched. You would need more checking if all tables have free numbers, but it shouldn't be that bad.
Create different procedures (or even tables) to handle fetching of single ID, hundreds of IDs and millions of IDs.

Resources