Clickhouse is not inserting new data into Materialized view - database

I have created a materialized view using this query
CREATE MATERIALIZED VIEW db.top_ids_mv (
`date_time` DateTime,
`id` String,
`total` UInt64
) ENGINE = SummingMergeTree
ORDER BY
(date_time, id) SETTINGS index_granularity = 8192 POPULATE AS
SELECT
toDateTime((intDiv(toUInt32(date_time), 60 * 60) * 60) * 60) AS date_time,
id AS id,
count(*) AS count
FROM
db.table
WHERE
type = 'user'
GROUP BY
date_time,id
My table contains almost 18 billion records. I have inserted my old data using POPULATE. But newly inserted data is not getting inserted into this materialized view. I have created many other views and they are working fine but this is creating issue.
This is what I am receiving in logs
2021.09.23 19:54:54.424457 [ 137949 ] {5b0f3c32-2900-4ce4-996d-b9696bd38568} <Trace> PushingToViewsBlockOutputStream: Pushing (sequentially) from db.table (15229c91-c202-4809-9522-9c91c2028809) to db.top_ids_mv (0cedb783-bf17-42eb-8ced-b783bf1742eb) took 0 ms.
One thing I noticed is that it is taking 0ms. I think that is wrong because query must take some time.
Thanks. Any help would be appreciated

SummingMergeTree does not store rows with metrics == 0.
total UInt64 <----> count(*) AS count -- names does not match. Your Mat.View inserts 0 into total, count goes nowhere.
Both are expected and specifically implemented for the reasons.
https://den-crane.github.io/Everything_you_should_know_about_materialized_views_commented.pdf
...
SELECT
toDateTime((intDiv(toUInt32(date_time), 60 * 60) * 60) * 60) AS date_time,
id AS id,
count(*) AS total --<<<<<------
FROM
db.table
...
For query performance and better data compression I would do
ENGINE = SummingMergeTree
ORDER BY ( id, date_time ) --- order id , time
Also try codecs
`date_time` DateTime CODEC(Delta, LZ4),
`id` LowCardinality(String),
`total` UInt64 CODEC(T64, LZ4)

Related

Get random data from SQL Server without performance impact

I need to select random rows from my sql table, when search this cases in google, they suggested to ORDER BY NEWID() but it reduces the performance. Since my table has more than 2'000'000 rows of data, this solution does not suit me.
I tried this code to get random data :
SELECT TOP 10 *
FROM Table1
WHERE (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) AS INT)) % 100) < 10
It also drops performance sometimes.
Could you please suggest good solution for getting random data from my table, I need minimum rows from that tables like 30 rows for each request. I tried TableSAMPLE to get the data, but it returns nothing once I added my where condition because it return the data by the basis of page not basis of row.
Try to calc the random ids before to filter your big table.
since your key is not identity, you need to number records and this will affect performances..
Pay attention, I have used distinct clause to be sure to get different numbers
EDIT: I have modified the query to use an arbitrary filter on your big table
declare #n int = 30
;with
t as (
-- EXTRACT DATA AND NUMBER ROWS
select *, ROW_NUMBER() over (order by YourPrimaryKey) n
from YourBigTable t
-- SOME FILTER
WHERE 1=1 /* <-- PUT HERE YOUR COMPLEX FILTER LOGIC */
),
r as (
-- RANDOM NUMBERS BETWEEN 1 AND COUNT(*) OF FILTERED TABLE
select distinct top (#n) abs(CHECKSUM(NEWID()) % n)+1 rnd
from sysobjects s
cross join (SELECT MAX(n) n FROM t) t
)
select t.*
from t
join r on r.rnd = t.n
If your uniqueidentifier key is a random GUID (not generated with NEWSEQUENTIALID() or UuidCreateSequential), you can use the method below. This will use the clustered primary key index without sorting all rows.
SELECT t1.*
FROM (VALUES(
NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())
,(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())
,(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())) AS ThirtyKeys(ID)
CROSS APPLY(SELECT TOP (1) * FROM dbo.Table1 WHERE ID >= ThirtyKeys.ID) AS t1;

Ugly table query

I have inherited this table and trying to optimize the queries. I am stuck with one query. Here is the table information
RaterName - varchar(24) - name of the rater
TimeTaken - varchar(12) - is stored as 00:10:14:8
Year - char(4) - is stored as 2014
I need
distinct list of raters, total count for the rater, sum(TimeTaken) for rater, avg(timetaken) for rater (for a given year)
I also need sum(timetaken) and avg(TimeTaken) for all the raters (for a given year)
Here is the query that I have come up with for #1... I would like the sum and avg to be like hh:mm:ss. How can I do this?
SELECT
[RaterName]
, count(*) as TotalRatings
, SUM((DATEPART(hh,convert(datetime, timetaken, 101))*60)+DATEPART(mi,convert(datetime, timetaken, 101))+(DATEPART(ss,convert(datetime, timetaken, 101))/(60.0)))/60.0 as TotalTimeTaken
, AVG((DATEPART(hh,convert(datetime, timetaken, 101))*60)+DATEPART(mi,convert(datetime, timetaken, 101))+(DATEPART(ss,convert(datetime, timetaken, 101))/(60.0)))/60.0 as AverageTimeTaken
FROM
[dbo].[rating]
WHERE
year = '2014'
GROUP BY
RaterName
ORDER BY
RaterName
Output:
RaterName TotalRatings TotalTimeTaken AverageTimeTaken
================================================================
Rater1 257 21.113609 0.082154
Rater2 747 41.546106 0.055617
Rater3 767 59.257218 0.077258
Rater4 581 37.154163 0.063948
Can I incorporate #2 in this query or write a second query and drop group by from it?
On the front end, I am using C#.
WITH data ( raterName, timeTaken )
AS (
SELECT raterName,
DATEDIFF(MILLISECOND, CAST('00:00' AS TIME),
CAST(timeTaken AS TIME))
FROM rating
WHERE CAST([year] AS INT) = 2014
)
SELECT raterName, COUNT(*) AS totalRatings,
SUM(timeTaken) AS totalTimeTaken, avg(timeTaken) AS averageTimeTaken
FROM data
GROUP BY raterName
ORDER BY raterName;
PS: If you don't want milliseconds, can make that Second or Minute.
EDIT: On your C# frontend you can make the Milliseconds or Seconds to a TimeSpan which would give you the format when you use ToString. ie:
var ttt = TimeSpan.FromSeconds(totalTimeTaken).ToString();

sql cross table calculations

Hi i need to write a query that does multiple things, i made it so it can get the details of orders from within a certain time frame as well as for ages between 20 and 30, however i need to check if the orders product cost more then a set amount
however that data is in multiple tables
one table has the orderid the prodcode and quantity, while the other day has the prod information such as code and price, and im 3rd from another table
So i need to access the price of the product with the prodcode and quantity to do a cross table calculation and see if its above 100 and trying to do this with an and where command
so if i have 3 tables
Orderplaced table with oid odate custno paid
ordered table with oid itemid quant
items itemid itemname price
and i need to do a calcultion across those tabkes in my query
SELECT DISTINCT Orderplaced.OID, Orderplaced.odate, Orderplaced.custno, Orderplaced.paid
FROM Cust, Orderplaced, items, Ordered
WHERE Orderplaced.odate BETWEEN '01-JUL-14' AND '31-DEC-14'
AND Floor((sysdate-Cust.DOB) / 365.25) Between '20' AND '30'
AND Cust.SEX='M'
AND items.itemid=ordered.itemid
AND $sum(ordered.quan*item.PRICE) >100;
no matter what way i try to get the calculation to work it doesnt seem to work always returns the same result even on orders under 100 dollars
so any advice on this would be good as its for my studies but is troubling me a lot
I think this is what you want. (I not familiar with $sum, I've replaced it with SUM())
SELECT
Orderplaced.OID,
Orderplaced.odate,
Orderplaced.custno,
Orderplaced.paid,
sum(ordered.quan * item.PRICE)
FROM
Cust
JOIN Orderplaced ON Cust.CustNo = Orderplaced.custno
JOIN Ordered ON Ordered.Oid = Orderplaced.Oid
JOIN items ON items.itemid = ordered.itemid
WHERE
Orderplaced.odate BETWEEN date 2014-07-01 AND date 2014-12-31
AND Floor((sysdate-Cust.DOB) / 365.25) Between 20 AND 30
AND Cust.SEX = 'M'
GROUP BY
Orderplaced.OID,
Orderplaced.odate,
Orderplaced.custno,
Orderplaced.paid
HAVING
sum(ordered.quant * item.PRICE) > 100;
I think you want to try something like this...
SELECT DISTINCT Orderplaced.OID, Orderplaced.odate, Orderplaced.custno, Orderplaced.paid
FROM Cust
JOIN Orderplaced ON
Cust.<SOMEID> = OrderPlaces.<CustId>
AND Orderplaced.odate BETWEEN '01-JUL-14' AND '31-DEC-14'
WHERE Floor((sysdate-Cust.DOB) / 365.25) Between 20 AND 30
AND Cust.SEX='M'
AND (
SELECT SUM(Ordered.quan*Item.PRICE)
FROM Ordered
JOIN Item ON Item.ItemId = Ordered.ItemId
WHERE Ordered.<SomeId> = OrderPlaced.<SomeId>) > 100
Couple of pointers:
1. Floor returns a number... you are comparing it to a string
2. Typically, when referencing a table in a query, the table has to be joined on its primary keys, ie. In your query you're referencing Item and ordered, without joining any of those tables on any key columns.
Hope that helps

Add table data to itself, incrementing timestamp

I have a table with dummy data in it, with 40,000 rows, and a timestamp on each row that increments by a few milliseconds. I want to multiply these rows by, say, 10, each 40,000 rows incrementing by a day, and hour, whatever I set it to be.
Is there a way to select data from a table and then feed it back into itself with one column changed slightly?
FWIW, there are 33 columns on this table.
Any help is appreciated!
The mysql code from gustavotkg is along the right lines.
INSERT INTO mytable (event_ts, col1, col2)
SELECT event_ts + interval '1 day', col1, col2
FROM mytable
WHERE event_ts BETWEEN <something> AND <something else>
Repeat with different intervals for multiple copies.
It is unclear whether you want to just update the rows or also select them at the same time, or even insert new rows. Updating is very simple:
UPDATE tbl
SET col1 = col1*10
,ts = ts + interval '1 day'
To also return all rows like a SELECT statement would (the updated state!):
UPDATE tbl
SET col1 = col1*10
, ts = ts + interval '1 day'
RETURNING *
If you actually want to INSERT new rows with just one column changed and the timestamp changed, and the point is to avoid having to type out all 33 columns, you could:
CREATE TEMP TABLE tbl_tmp AS SELECT * FROM tbl;
UPDATE tbl_tmp SET col1 = col1*10, ts = ts + interval '1 day';
INSERT INTO tbl SELECT * FROM tbl_tmp;
DROP tbl_tmp;
OR somewhat faster with the new writable CTEs in version 9.1:
CREATE TEMP TABLE ON COMMIT DROP tbl_tmp AS SELECT * FROM tbl;
WITH x AS (
UPDATE tbl_tmp SET col1 = col1*10, ts = ts + interval '1 day'
RETURNING *
)
INSERT INTO tbl SELECT * FROM x;
DROP tbl_tmp;
Be sure to have autovacuum running or run VACUUM ANALYZE manually afterwards.

Getting a Subset of Records along with Total Record Count

I'm working on returning a recordset from SQL Server 2008 to do some pagination. I'm only returning 15 records at a time, but I need to have the total number of matches along with the subset of records. I've used two different queries with mixed results depending on where in the larger group I need to pull the subset. Here's a sample:
SET NOCOUNT ON;
WITH tempTable AS (
SELECT
FirstName
, LastName
, ROW_NUMBER() OVER(ORDER BY FirstName ASC) AS RowNumber
FROM People
WHERE
Active = 1
)
SELECT
tempTable.*
, (SELECT Max(RowNumber) FROM tempTable) AS Records
FROM tempTable
WHERE
RowNumber >= 1
AND RowNumber <= 15
ORDER BY
FirstName
This query works really fast when I'm returning items on the low end of matches, like records 1 through 15. However, when I start returning records 1000 - 1015, the processing will go from under a second to more than 15 seconds.
So I changed the query to the following instead:
SET NOCOUNT ON;
WITH tempTable AS (
SELECT * FROM (
SELECT
FirstName
, LastName
, ROW_NUMBER() OVER(ORDER BY FirstName ASC) AS RowNumber
, COUNT(*) OVER(PARTITION BY NULL) AS Records
FROM People
WHERE
Active = 1
) derived
WHERE RowNumber >= 1 AND RowNumber <= 15
)
SELECT
tempTable.*
FROM tempTable
ORDER BY
FirstName
That query runs the high number returns in 2-3 seconds, but also runs the low number queries in 2-3 seconds as well. Because it's doing the count for each of 70,000+ rows, it makes every request take longer instead of just the large row numbers.
So I need to figure out how to get a good row count, as well as only return a subset of items at any point in the resultset without suffering such a huge penalty. I could handle a 2-3 second penalty for the high row numbers, but 15 is too much, and I'm not willing to suffer slow loads on the first few pages a person views.
NOTE: I know that I don't need the CTE in the second example, but this is just a simple example. In production I'm doing further joins on the tempTable after I've filtered it down to the 15 rows I need.
Here is what I have done (and its just as fast, no matter which records I return):
--Parameters include:
#pageNum int = 1,
#pageSize int = 0,
DECLARE
#pageStart int,
#pageEnd int
SELECT
#pageStart = #pageSize * #pageNum - (#pageSize - 1),
#pageEnd = #pageSize * #pageNum;
SET NOCOUNT ON;
WITH tempTable AS (
SELECT
ROW_NUMBER() OVER (ORDER BY FirstName ASC) AS RowNumber,
FirstName
, LastName
FROM People
WHERE Active = 1
)
SELECT
(SELECT COUNT(*) FROM tempTable) AS TotalRows,
*
FROM tempTable
WHERE #pageEnd = 0
OR RowNumber BETWEEN #pageStart AND #pageEnd
ORDER BY RowNumber
I've handled a situation a bit similar to this in the past by not bothering to determine a definite row count, but using the query plan to give me an estimated row count, a bit like the first item in this link describes:
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=108658
The intention was then to deliver whatever rows have been asked for within the range (from say 900-915) and then returning the estimated row count, like
rows 900-915 of approx. 990
which avoided having to count all rows. Once the user moves beyond that point, I just showed
rows 1000-1015 of approx. 1015
i.e. just taking the last requested row as my new estimate.

Resources