I am new to flink. I follow the quickstart on the flink website and deploy flink on a single machine. After I do "./bin/flink run examples/streaming/SocketWindowWordCount.jar --port 9000" and enter the words as the website described, I get the result:final result
It seems that the program didn't do the reduce,I just want to know why? Thanks.
The program did do a reduce, but not fully, because your input must have fallen into two different 5 second windows. That's why the 4 instances of ipsum where reported as 1 + 3 -- the first one fell into one window, and the other 3 into another window (along with the "bye").
Flink's window boundaries are based on alignment with the clock. So if your input events occurred between 14:00:04 and 14:00:08, for example, they would fall into two 5 second windows -- one for 14:00:00 - 14:00:04.999 and another for 14:00:05 - 14:00:09.999 -- even though all of your events fit into a single interval that's only 4 seconds long.
If you try again, you can expect to see similar, but probably different results. That's a consequence of doing windowed analytics based on "processing time". If you want your applications to get repeatable results, you should plan on using "event time" analytics instead (where the timestamps are based on when the events occurred, rather than when they were processed).
Related
I have a use case where I have 2 input topics in kafka.
Topic schema:
eventName, ingestion_time(will be used as watermark), orderType, orderCountry
Data for first topic:
{"eventName": "orderCreated", "userId":123, "ingestionTime": "1665042169543", "orderType":"ecommerce","orderCountry": "UK"}
Data for second topic:
{"eventName": "orderSucess", "userId":123, "ingestionTime": "1665042189543", "orderType":"ecommerce","orderCountry": "USA"}
I want to get all the userid for orderType,orderCountry where user does first event but not the second one in a window of 5 minutes for a maximum of 2 events per user for a orderType and orderCountry (i.e. upto 10 mins only).
I have union both topics data and created a view on top of it and trying to use flink cep sql to get my output, but somehow not able to figure it out.
SELECT *
FROM union_event_table
MATCH_RECOGNIZE(
PARTITION BY orderType,orderCountry
ORDER BY ingestion_time
MEASURES
A.userId as userId
A.orderType as orderType
A.orderCountry AS orderCountry
ONE ROW PER MATCH
PATTERN (A not followed B) WITHIN INTERVAL '5' MINUTES
DEFINE
A As A.eventName = 'orderCreated'
B AS B.eventName = 'orderSucess'
)
First thing is not able to figure it out what to use in place of A not followed B in sql, another thing is how can I restrict the output for a userid to maximum of 2 events per orderType and orderCountry, i.e. if a user doesn't perform 2nd event after 1st event in 2 consecutive windows for 5 minutes, the state of that user should be removed, so that I will not get output of that user for same orderType and orderCountry again.
I don't believe this is possible using MATCH_RECOGNIZE. This could, however, be implemented with the DataStream CEP library by using its capability to send timed out patterns to a side output.
This could also be solved at a lower level by using a KeyedProcessFunction. The long ride alerts exercise from the Apache Flink Training repo is an example of that -- you can jump straight away to the solution if you want.
I am new to apache flink and am trying to understand how the concept of EventTime and Windowing is handled by flink.
So here's my scenario :
I have a program that runs as a thread and creates a files with 3
fields every second of which the 3rd field is the timestamp.
There is a little tweak though every 5 seconds I enter an older timestamp (t-5 you could say) into the new file created.
Now I run the stream processing job which reads the 3 fields above
into a tuple.
Now I have defined the following code for watermarking and timestamp generation:
WatermarkStrategy
.<Tuple3<String, Integer, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(4))
.withTimestampAssigner((event, timestamp) -> event.f2);
And then I use the following code for windowing the above and trying to get the aggregation :
withTimestampsAndWatermarks
.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.milliseconds(4000)))
.reduce((x,y) -> new Tuple3<String, Integer, Long>(x.f0, x.f1 + y.f1,y.f2))
It is clear that I am trying to aggregate the numbers within each field.(a little more context, the field(f2) that I am trying to aggregate are all 1s)
Hence I have the following questions :
That is the window is just 4 seconds wide, and every fifth entry is
an older timestamp, so I am expecting that the next window to have
lesser counts. Is my understanding wrong here ?
If my understanding is right - I do not see any aggregation when running both programs in parallel, Is there something wrong with my code ?
Another one that is bothering me is on what fields or on what parameters do the windows start time and end time really dependent ? Is it on the timestamp extracted from Events or is it from processing time
You have to configure the allowed lateness: https://nightlies.apache.org/flink/flink-docs-release-1.2/dev/windows.html#allowed-lateness. If not configured, Flink will drop the late message. So for the next window, there will be less elements than previous window.
Window is assigned by the following calculation:
return timestamp - (timestamp - offset + windowSize) % windowSize
In your case, offset is 0(default). For event time window, the timestamp is the event time. For processing time window, the timestamp is the processing time from Flink operator. E.g. if windowSize=3, timestamp=122, then the element will be assigned to the window [120, 123).
I have a requirement where I need to generate Unique Application Number and in sequential format.
Sample Application Number - APP-Date-0001 the 001 will keep on increasing for entire day and counter should be reset next day. So for next day it should again start from 001 with current date.
The problem will occur when 2 users are creating application at the same time.
Keep counter and last date it was used in custom setting or similar object.
But access that custom setting with normal SOQL, not via the special custom setting methods like getInstance().
Finally - in that SOQL query use FOR UPDATE. https://developer.salesforce.com/docs/atlas.en-us.soql_sosl.meta/soql_sosl/sforce_api_calls_soql_select_for_update.htm
If 2 operations start on same time - 1 will be held until other one finishes or timeout happens
How can i use the Ingestion time characteristics in Apache flink. I know we need to set the environment time characteristics. But how can i collect the data with timestamps which can be referred as ingestion time. Currently when i am using it, it is processing the window based on system clock time. I want to do the processing based on the time at which data enters the flink environment.
A little code extract which may help to understand it clearly :
Time characteristics for environment :
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
Window time :
keyedEvents.timeWindow(Time.minutes(5))
Collection in source :
ctx.collect(monSourceData);
If the data collection starts at let say 11:03, I want to end it at 11:08 i.e. for 5 minutes. But it stops at 11:05 ( somehow behaving like processing time ).
Thanks in advance for your help.
Tumbling and sliding windows in Flink are always aligned to the clock (either the event time clock defined by the events and watermarks, or the system clock); time windows are not aligned to first event. So if you have windows that are 5 minutes long, there will be a window for the interval from 11:00 to 11:05, for example, regardless of the TimeCharacteristic.
Tumbling windows do, however, take an optional offset parameter that can be used to shift this alignment. So you could specify TumblingEventTimeWindows.of(Time.minutes(5), Time.minutes(3)), for example to shift the intervals by 3 minutes.
I'm programming a site in PHP/MySQL that gets search results for products via API from an external site. This site also will have it's own products and the owners of the site want the search results to be inter-connected.
If someone searches for VIDEO, ordered by date then the results should be all in order regardless of the source it came from.
eg.
July 31 - Video A - our database
July 30 - Video B - via API
July 29 - Video C - via API
July 28 - Video D - our database
...
The problem I'm having is figuring out a way to do this effectively especially regarding viewing multiple pages of results. If someone clicks to the 2nd page of results then I need to figure out the last item on the first page of results (and the last item from the API), then only get the items from the API starting after the last API item viewed on the previous page and then do the same for our database results and re-combine them again.
In order to avoid this complex algorithm, another idea I had was to limit the results to a large amount - like 500 results and grab them all at once and order them. Then if the user goes forward a few pages, I do not have to re-grab all the data.
Does anyone have suggestions on good algorithms to use to combine two search results?
Whether you use it for caching or not, you will need to grab at least a page worth of results from both sources, in case all the next results will come from that source.
Grabbing a lot of results and caching them (in the session) is one solution you could use.
If for some reason you don't want to cache all the results (if the operation is expensive and you need this optimized), you could store a simple array in the session that contains the location of the results, and then you would know the starting number for the next page.
For example (pseudo code)
**Request 1**
Get 10 results from API
Get 10 results form Database
Merge the results
Display first 10 and save the order to an array
(A for API, D for Database, ex: A,A,A,D,A,D,D,A,D,A)
User clicks page 2
**Request 2** (Page 2)
Get 10 results from API starting at 5
Get 10 results from Database starting at 7
Repeat merge and display above.
You could also optionally cache what you have needed to retrieve so far (and you will have 10 extra results). This would make the first request longer, but could possibly make the second request much faster.
If the user jumps forward several pages, you would need to get the largest number of results that could have been displayed in the preceeding unknown pages from each source.
If you are not too worried about performance from either source, I would retrieve up to a large number like you said and cache all results temporarily. As soon as a new search is executed, dump the old results.