So I want to alert when my watermark falls behind.
I want to use metrics reported by flink's job manager. Something like this, but this doesnt work as I like it.
(timestamp(flink_taskmanager_job_task_operator_currentInputWatermark{task_name=~"my_window.*"})-(4*60*60*1000))-flink_taskmanager_job_task_operator_currentInputWatermark{task_name=~"my_window.*"}
Verbally : i'd like to get a diff in currentTime (time when the metric was reported) - wmatermark ts.
(4*60*60*1000) is to convert to EDT -- is there a better way to do this ?
OK. so the above query was almost perfect. what I was doing wrong is shifting an already EDT timestamp to -4h. Below is the perfect query to do this:
timestamp(flink_taskmanager_job_task_operator_currentInputWatermark{task_name="my_window",job_name="session"})*1000-flink_taskmanager_job_task_operator_currentInputWatermark{task_name="my_window",job_name="session"}
the flink_taskmanager_job_task_operator_currentInputWatermark reports doesnt report in ms but timestamp does hence the *1000 conversion
Related
I am use flink-1.13 sql. I have a kafka table like
create my_table(
id string,
event_time timestamp(3)
watermark for time as ...
)
I want to group messages every 10 minutes like tumble window, besides I want to recalculate late messages within 1 hour.
One of the way I know is use a udf like
select count(1) from my_table
where event_time >= '1 hour ago'
group by ten_minutes_udf(event_time)
But this way flink state never expired and I can't find a suitable Window TVF Aggregation to do it
Is there another way to do this?
In Flink 1.14 a current_watermark() function was added that can be used to detect and operate on late events.
Since before 1.13 there is an experimental table.exec.emit.allow-lateness configuration option that can be used with the (now legacy) window operations (and not with window TVFs).
so I'm simulating a streaming task using Flink DataStream and I want to execute an SQL query on each window.
Let's say this is the query
SELECT name, age, sum(days), avg(salary)
FROM employees
WHERE age > 25
GROUP BY name, age
ORDER BY name, age
I'm having a hard time to translate it to Flink. From my understanding, to calculate average I need to do it manually using .apply() and WindowFunction. But how do I calculate the sum then? Also manually in the same WindowFunction?
I'm also wondering if it is possible to do order by on the whole window?
Below is the pseudocode of what I thought of so far. Any help would be appreciated! Thanks!
employeesStream
.filter(new FilterFunction() ....) \\ where clause
.keyby(nameIndex, ageIndex) \\ group by??
.timeWindow(Time.seconds(10), Time.seconds(1))
.apply(new WindowFunction() ....) \\ calculate average (and sum?)
// order by??
I checked the Table API but it seems for streaming not a lot of operations are supported, e.g orderBy.
Ordering in streaming is not trivial. How do you want to sort something that is never ending? In your example you want to calculate an average or a sum, which is just one value per window. You cannot sort one value.
Another possibility is to buffer all values and wait for an indicator of completeness to start sorting. Thanks to event-time and watermarks, it is possible to sort a stream if you know that you have seen all values until a certain time (aka watermarks).
Event-time sort has been introduced recently and will be part of Flink 1.4 Table API. See here for an example.
I'm converting something from SQL Server to PostgreSQL. There's a table with a
calculated field between a BeginTime and an EndTime called MidTime. The times are offsets from the beginning of a video clip and will never be more than about 6 minutes long. In SQL Server, BeginTime, EndTime, and MidTime are all TimeSpans. You can use this as the function:
DATEADD(ms, DATEDIFF(ms,BeginTime, EndTime)/2, BeginTime)
Which is taking the difference in the two timespans in millseconds, dividing it by 2, and then adding it to the BeginTime. Super straightforward. Result looks like this:
ID BeginTime EndTime MidTime
10137 00:00:05.0000000 00:00:07.0000000 00:00:06.0000000
10138 00:00:08.5000000 00:00:09.6660000 00:00:09.0830000
10139 00:00:12.1660000 00:00:13.4000000 00:00:12.7830000
10140 00:00:14.6000000 00:00:15.7660000 00:00:15.1830000
10141 00:00:17.1330000 00:00:18.3000000 00:00:17.7160000
10142 00:00:19.3330000 00:00:21.5000000 00:00:20.4160000
10143 00:00:23.4000000 00:00:25.4000000 00:00:24.4000000
10144 00:00:25.4330000 00:00:26.8330000 00:00:26.1330000
I've looked at all of the different things available to me in PostgreSQL and don't see anything like this. I'm storing BeginTime and EndTime as "time without time zone" time(6) values, and they look right in the database. I can subtract these from each other, but I can't get the value in milliseconds to halve (division of times is not allowed) and then there's no obvious way to add the milliseconds back into the BeginTime.
I've looked at EXTRACT which when you ask for milliseconds gives you the value of second and milliseconds, but just that part of the time. I can't seem to get a representation of the time that I can subtract, divide, and then add the result back into another time.
I'm using Postgres 9.4 and I don't see any simple way of doing this without breaking the date into component parts and getting overall milliseconds (seems like it would work but I don't want to do such an ugly thing if I don't need to), or converting everything to a unix datetime and then doing the calculations and then it's not obvious how to get it back into a "time without time zone."
I'm hoping there's something elegant that I'm just missing? Or maybe a better way to store these where this work is easier? I am only interested in the time part so time(6) seemed closest to Sql Server's TimeSpan.
Just subtract one from the other divide it by two and add it to begintime:
begintime + (endtime - begintime)/2
It is correct that you can't divide a time value. But the result of endtime - begintime is not a time but an interval. And you can divide an interval by 2.
The above expression works with time, timestamp or interval columns.
What am I doing wrong in this query?
SELECT * FROM TreatmentPlanDetails
WHERE
accountId = 'ag5zfmRvbW9kZW50d2ViMnIRCxIIQWNjb3VudHMYtcjdAQw' AND
status = 'done' AND
category = 'chirurgia orale' AND
setDoneCalendarEventStartTimestamp >= [timestamp for 6 june 2012] AND
setDoneCalendarEventStartTimestamp <= [timestamp for 11 june 2012] AND
deleteStatus = 'notDeleted'
ORDER BY setDoneCalendarEventStartTimestamp ASC
I am not getting any record and I am sure there are records meeting the where clause conditions. To get the correct records I have to widen the timestamp interval by 1 millisecond. Is it normal? Furthermore, if I modify this query by removing the category filter, I am getting the correct results. This is definitely weird.
I also asked on google groups, but I got no answer. Anyway, for details:
https://groups.google.com/forum/?fromgroups#!searchin/google-appengine/query/google-appengine/ixPIvmhCS3g/d4OP91yTkrEJ
Let's talk specifically about creating timestamps to go into the query. What code are you using to create the timestamp record? Apparently that's important, because fuzzing with it a little bit affects the query. It may be relevant that in the datastore, timestamps are recorded as integers representing posix timestamps with microseconds, i.e. the number of microseconds since 1/1/1970 UTC (not counting leap seconds). It's also relevant that dates (i.e. without a time) are represented as midnight, i.e. the earliest time on that day. But please show us the exact code. (It may also be important to show the actual content of the record that you're attempting to retrieve.)
An aside that is not specific to your question: Entity property names count as part of your storage quota. If this is going to be a huge dataset, you might pay more $$ than you'd like for property names like setDoneCalendarEventStartTimestamp.
Because you write :
if I modify this query by removing the category filter, I am getting
the correct results
this probably means that the category was not indexed at the time you write the matching records to the data store. You have to re-write your records to the data store if you want them added to the newly created index.
I am looking for the HQL equivalent of converting x amounts days from current timestamp to a queriable value.
So like this sudo-HQL :
from Newspaper as newspaper
where newspaper.published < current_timestamp - days(:daysparam)
And then daysparam is injected as query parameter. And published is date field.
Is this in anyway doable in HQL only, without writing your own hibernate dialect or using criteria in actual code? It seems such as standard feature to not be supported by plain HQL seems strange.
I am using Spring batch's HibernatePagingItemReader which is xml only, so I wanted to avoid the yakshaving of extending that class or creating my own custom dialect etc.
Similar question seems to only suggest calendar critera or new dialect:
Performing Date/Time Math In HQL?
How to perform date operations in hibernate HQL
Something doesn't look quite right: you said that published is a date field (in the database, I suppose). Then, you are using current_timestamp minus an integer value to compare with the date field. As result, you are not getting the timestamp for the date in the parameter, you are just getting current_timestamp - 2, which I don't believe represents "two days ago" ;-) If you have used current_date, I guess it might work.
from Newspaper as newspaper where newspaper.published < current_date - :daysparam
But still, I'd prefer to leave this calculation to the Java side, so that the query would be:
from Newspaper as newspaper where newspaper.published < :start_date
This won't work only if you are not using UTC in your servers (which you should).