Optimal primary key on this ClickHouse schema for aggregation - database

I have a ClickHouse schema as following, MergeTree is in question:
(
hotel String,
staff_member String,
task_number Float64,
date DateTime
)
PRIMARY KEY (hotel, date)
ORDER BY (hotel, date)
My aggregation is as following:
SELECT
staff_member,
sum(task_number)
FROM ...
WHERE
hotel = {hotel}
AND date >= {first_date}
AND date <= {top_date}
GROUP BY staff_member
Basically, I'm aggregating the number of tasks of a staff member over a period of time, but the aggregation is kind of slow. I have a feeling the primary key is off and I need to rework it.
First that comes to mind would be to change the key to (hotel, staff_member, date) since I'm grouping by the staff_member
I'm thankful for any help!

Related

How would one create a DATE column that is updated via a DATETIME column

Consider the following:
CREATE TABLE mytable
(
[id] INT NOT NULL,
[foobar] VARCHAR(25) NULL,
[created_on] DATETIME NOT NULL
);
SELECT *
FROM mytable
WHERE CAST(created_on AS DATE) = '2019-01-01';
I have a lot of queries like this, where I need to store the full date and time for audit (and sorting) purposes, but most queries only care about the date portion when it comes to searching.
In order to improve performance, I was considering adding a sister column that stores the value as a DATE, and then update it via triggers; but before I go down that rabbit hole, I wanted to know if there is a better way to solve this issue. Is there some mechanism in SQL Server that offers a better solution to this issue?
I am currently stuck on SQL Server 2008, but I am open to solutions that use newer versions
My preference would be to just write a sargable
WHERE created_on >= '2019-01-01' and created_on < '2019-01-02';
The
CAST(created_on AS DATE) = '2019-01-01';
Is in fact mostly sargable but somewhat sub optimal ...
... and splitting it out into a separate indexed column can help other cases like GROUP BY date
If you decide you do need a separate column you can create a computed column and index that.
This is preferable to triggers as it has less performance overhead as well as allowing SQL Server to match both the column name and the original expression. (any index on a column populated by a trigger won't be matched to a query containing CAST(created_on AS DATE))
CREATE TABLE mytable
(
[id] INT NOT NULL,
[foobar] VARCHAR(25) NULL,
[created_on] DATETIME NOT NULL,
[created_on_date] AS CAST(created_on AS DATE)
);
CREATE INDEX ix_created_on_date
ON mytable(created_on_date)
include (foobar, id, created_on)
SELECT foobar,
id,
created_on
FROM mytable
WHERE CAST(created_on AS DATE) = '2019-01-01';

Indexed view - SUM function that references a nullable expression

I have table with time entries with following columns:
Id (PK)
Date
EmployeeId
State (state of the entry New, Approved, etc.)
Quantity
And I would like to create an indexed view which groups time entries by day and employee. So I used:
CREATE VIEW dbo.Test1
WITH SCHEMABINDING
AS
SELECT
Date, EmployeeId, SUM(Quantity), SUM(CASE State = 1 THEN Quantity ELSE NULL END) AS QuantityApproved
FROM
TimeEntries
GROUP BY
EmployeeId, Date
CREATE UNIQUE CLUSTERED INDEX IDX_V1
ON dbo.Test1 (EmployeeId, Date);
GO
But when I try to make it an indexed view an error occurs:
Cannot create the clustered index "IDX_V1" on view "dbo.Test1" because the view references an unknown value (SUM aggregate of nullable expression). Consider referencing only non-nullable values in SUM. ISNULL() may be useful for this.
Obviously using ISNULL would help in case of QuantityApproved column. But this is not a solution for me as 0 may also indicate there are 2 records (Quantity=-1 and QUantity=1) on the same day.
Also I can use an auxiliary column for ABS value for this case, but having NULL there is very convenvient as I do not need to solve anything else.
Is there any other way to overcome this?

Query for records with overlapping dates in cassandra

Given a table with startDate and endDate columns how do I query for records where the periods overlap. The dates currently use DateField type. Would changing the type to DateRangeField help with this query? We are using Cassandra with Solr.
i am just sharing a data model for records where the periods overlap event
CREATE TABLE event (
start_date date,
id timeuuid,
end_date date,
PRIMARY KEY ((start_date), id))
SELECT * FROM event WHERE start_date =date;
i think if you create your data model like this and query like this your problem will be solve.

Proper way to index date & time columns

I have a table with the following structure:
CREATE TABLE MyTable (
ID int identity,
Whatever varchar(100),
MyTime time(2) NOT NULL,
MyDate date NOT NULL,
MyDateTime AS (DATEADD(DAY, DATEDIFF(DAY, '19000101', [MyDate]),
CAST([MyDate] AS DATETIME2(2))))
)
The computed column adds date and time into a single datetime2 field.
Most queries against the table have one or more of the following clauses:
... WHERE MyDate < #filter1 and MyDate > #filter2
... ORDER BY MyDate, MyTime
... ORDER BY MyDateTime
In a nutshell, date is usually used for filtering, and full datetime is used for sorting.
Now for questions:
What is the best way to set indices on those 3 date-time columns? 2 separate on date and time or maybe 1 on date and 1 on composite datetime, or something else? Quite a lot of inserts and updates occur on this table, and I'd like to avoid over-indexing.
As I wrote this question, I noticed the long and kind of ugly computed column definition. I picked it up from somewhere a while ago and forgot to investigate if there's a simpler way of doing it. Is there any easier way of combining a date and time2 into a datetime2? Simple addition does not work, and I'm not sure if I should avoid casting to varchar, combining and casting back.
Unfortunately, you didn't mention what version of SQL Server you're using ....
But if you're on SQL Server 2008 or newer, you should turn this around:
your table should have
MyDateTime DATETIME
and then define the "only date" column as
MyDate AS CAST(MyDateTime AS DATE) PERSISTED
Since you make it persisted, it's stored along side the table data (and now calculated every time you query it), and you can easily index it now.
Same applies to the MyTime column.
Having date and time in two separate columns may seem peculiar but if you have queries that use only the date (and/or especially only the time part), I think it's a valid decision. You can create an index on date only or on time or on (date, whatever), etc.
What I don't understand is why you also have the computed datetime column as well. There s no reason to store this value, too. It can easily be calculated when needed.
And if you need to order by datetime, you can use ORDER BY MyDate, MyTime. With an index on (MyDate, MyTime) this should be ok. Range datetime queries would also be using that index.
The answer isn't in your indexing, it's in your querying.
A single DateTime field should be used, or even SmallDateTime if that provides the range of dates and time resolution required by your application.
Index that column, then use queries like this:
SELECT * FROM MyTable WHERE
MyDate >= #startfilterdate
AND MyDate < DATEADD(d, 1, #endfilterdate);
By using < on the end filter, it only includes results from sometime before midnight of that date, which is the day after the user-selected "end date". This is simpler and more accurate than adding 23:59:59, especially since stored times can include microseconds between 23:59:59 and 00:00:00.
Using persisted columns and indexes on them is a waste of server resources.

SQL Server index - very large table with where clause against a very small range of values - do I need an index for the where clause?

I am designing a database with a single table for a special scenario I need to implement a solution for. The table will have several hundred million rows after a short time, but each row will be fairly compact. Even when there are a lot of rows, I need insert, update and select speeds to be nice and fast, so I need to choose the best indexes for the job.
My table looks like this:
create table dbo.Domain
(
Name varchar(255) not null,
MetricType smallint not null, -- very small range of values, maybe 10-20 at most
Priority smallint not null, -- extremely small range of values, generally 1-4
DateToProcess datetime not null,
DateProcessed datetime null,
primary key(Name, MetricType)
);
A select query will look like this:
select Name from Domain
where MetricType = #metricType
and DateProcessed is null
and DateToProcess < GETUTCDATE()
order by Priority desc, DateToProcess asc
The first type of update will look like this:
merge into Domain as target
using #myTablePrm as source
on source.Name = target.Name
and source.MetricType = target.MetricType
when matched then
update set
DateToProcess = source.DateToProcess,
Priority = source.Priority,
DateProcessed = case -- set to null if DateToProcess is in the future
when DateToProcess < DateProcessed then DateProcessed
else null end
when not matched then
insert (Name, MetricType, Priority, DateToProcess)
values (source.Name, source.MetricType, source.Priority, source.DateToProcess);
The second type of update will look like this:
update Domain
set DateProcessed = source.DateProcessed
from #myTablePrm source
where Name = source.Name and MetricType = #metricType
Are these the best indexes for optimal insert, update and select speed?
-- for the order by clause in the select query
create index IX_Domain_PriorityQueue
on Domain(Priority desc, DateToProcess asc)
where DateProcessed is null;
-- for the where clause in the select query
create index IX_Domain_MetricType
on Domain(MetricType asc);
Observations:
Your updates should use the PK
Why not use tinyint (range 0-255) to make the rows even narrower?
Do you need datetime? Can you use smalledatetime?
Ideas:
Your SELECT query doesn't have an index to cover it. You need one on (DateToProcess, MetricType, Priority DESC) INCLUDE (Name) WHERE DateProcessed IS NULL
`: you'll have to experiment with key column order to get the best one
You could extent that index to have a filtered indexes per MetricType too (keeping DateProcessed IS NULL filter). I'd do this after the other one when I do have millions of rows to test with
I suspect that your best performance will come from having no indexes on Priority and MetricType. The cardinality is likely too low for the indexes to do much good.
An index on DateToProcess will almost certainly help, as there is lilely to be high cardinality in that column and it is used in a WHERE and ORDER BY clause. I would start with that first.
Whether an index on DateProcessed will help is up for debate. That depends on what percentage of NULL values you expect for this column. Your best bet, as usual, is to examine the query plan with some real data.
In the table schema section, you have highlighted that 'MetricType' is one of two Primary keys, therefore this should definately be indexed along with the Name column. As for the 'Priority' and 'DateToProcess' fields as these will be present in a where clause it can't hurt to have them indexed also but I don't recommend the where clause you have on that index of 'DateProcessed' is null, indexing just a set of the data is not a good idea, remove this and index the whole of both those columns.

Resources