Apache flink complex analytics stream design & challenges - analytics

Problem statement:
Trying to evaluate Apache Flink for modelling advanced real time low latency distributed analytics
Use case abstract:
Provide complex analytics for instruments I1, I2, I3... etc each having product definition P1, P2, P3; configured with user parameters (Dynamic) U1, U2,U3 & requiring streaming Market Data M1, M2, M3...
Instrument Analytics function (A1,A2) are complex in terms of computation complexity, some of them could take 300-400ms but can be computed in parallel.
From above clearly Market data stream would be much faster (<1ms) than analytics function & need to consume latest consistent market data for calculations.
Next challenge is multiple Dependendant Enrichment functions E1,E2,E3 (e.g. Risk/PnL) which combine streaming Market data with instrument analytics result (E.g. Price or Yield)
Last challenge is consistency for calculations - as function A1 could be faster than A2 and need a consistent all instrument result from given market input.
Calculation Graph dependency examples (scale it to hundreds of instruments & 10-15 market data sources):
In case above image is not visible, graph dependency flow is like:
- M1 + M2 + P1 => A2
- M1 + P1 => A1
- A1 + A2 => E2
- A1 => E1
- E1 + E2 => Result numbers
Questions:
Correct design/model for these calculation data streams, currently I use ConnectedStreams for (P1 + M1), Another approach could be to use Iterative model feeding same instruments static data to itself again?
Facing issue to use just latest market data events in calculations as analytics function (A1) is lot slower than Market data (M1) streaming.
Hence need stale market data eviction for next iteration retaining those where no value is not available (LRU cache like)
Need to synchronize / correlate function execution of different time complexity so that iteration 2 starts only when everything in iteration 1 finished

This is quite a broad question and to answer it more precisely, one would need a few more details.
Below are a few thoughts that I hope will point you in a good direction and help you to approach your use case:
Connected streams by key (a.keyBy(...).connect(b.keyBy(...)) are the most powerful join- or union-like primitive. Using CoProcessFunction on a connected stream should give you the flexibility to correlate or join values as needed. You can for example store the events from one stream in the state while waiting for a matching event to arrive from the other stream.
Holding always the latest data of one input is easily doable by just putting that value into the state of a CoFlatMapFunction or a CoProcessFunction. For each event from input 1, you store the event in the state. Each event from stream 2, you look into the state to find the latest event from stream 1.
To synchronize on time, you could actually look into using event time. Event time can also be "logical time", meaning just a version number, iteration number, or anything. You only need to make sure that the timestamp you assign and the watermarks you generate reflect that consistently.
If you window by event time then, you will get all data of that version together, regardless of whether one operator is faster than others, or the events arrive via paths with different latency. That is the beauty of real event time processing :-)

Related

Does Flink's windowing operation process elements at the end of window or does it do a rolling processing?

I am having some trouble understanding the way windowing is implemented internally in Flink and could not find any article which explain this in depth. In my mind, there are two ways this can be done. Consider a simple window wordcount code as below
env.socketTextStream("localhost", 9999)
.flatMap(new Splitter())
.groupBy(0)
.window(Time.of(500, TimeUnit.SECONDS)).sum(1)
Method 1: Store all events for 500 seconds and at the end of the window, process all of them by applying the sum operation on the stored events.
Method 2: We use a counter to store a rolling sum for every window. As each event in a window comes, we do not store the individual events but keep adding 1 to previously stored counter and output the result at the end of the window.
Could someone kindly help to understand which of the above methods (or maybe a different approach) is used by Flink in reality. The reason is, there are pros and cons to both approach and is important to understand in order configure the resources for the cluster correctly.
eg: The Method 1 seems very close to batch processing and might potentially have issues related to spike in processing at every 500 sec interval while sitting idle otherwise etc while Method2 would need to maintain a common counter between all task managers.
sum is a reducing function as mentioned here(https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#reducefunction). Internally, Flink will apply reduce function to each input element, and simply save the reduced result in ReduceState.
For other windows functions, like windows.apply(WindowFunction). There is no aggregation so all input elements will be saved in the ListState.
This document(https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#window-functions) about windows stream mentions about how the internal elements are handled in Flink.

Tree evaluation in Flink

I have a usecase where I want to build a realtime a decision tree evaluator using Flink. I have a decision tree something like below:
Decision tree example
Root Node(Product A)---- Check if price of Product A increased by $10 in last 10mins
----------------------------
If Yes --> Left Child of A(Product B) ---> check if price of Product B increased by $20 in last 10mins ---> If not output Product B
----------------------------
If No ---> Right Child of A(Product C) ---> Check if price of Product C increased by $20 in last 10mins ---> If not output Product C
Note: This is just example of one decision tree, I have multiple such decision trees with different product type/number of nodes and different conditions. Want to write a common Flink app to evaluate all these.
Now in input I am getting an input data stream with prices of all product types(A, B and c) every 1min. To achieve my usecase one approach that I can think of is as follows:
Filter input stream by product type
For each product type, use Sliding Window over last X mins based on product type triggered every min
Process window function to check difference of prices for a particular product type and output price difference for each product type in output stream.
Now that we have price difference of each product type/nodes of the tree, we can then evaluate the decision tree logic. Now to do this, we have to make sure the processing of price diff calculation of all product types in a decision tree (Product A, B and C in above example) has to be completed before determining the output. One way is to store the outputs of all these products from output stream to a datastore and keep checking from an ec2 instance every 5s or so if all these price computations are completed. Once done, execute the decision tree logic to determine the output product.
Wanted to understand if there is any other way where this entire computation can be done in Flink itself without needing any other components(datastore/ec2). I am fairly new to Flink so any leads would be highly appreciated!

Flink Windows Boundaries, Watermark, Event Timestamp & Processing Time

Problem Definition & Establishing Concepts
Let’s say we have a TumblingEventTimeWindow with size 5 minutes. And we have events containing 2 basic pieces of information:
number
event timestamp
In this example, we kick off our Flink topology at 12:00 PM worker machines’ wall clock time (of course workers can have out of sync clocks but that’s out of the scope of this question). This topology contains one processing operator whose responsibility is to sum up the values of events belonging to each window and a KAFKA Sink which is irrelevant with regard to this question.
This window has a BoundedOutOfOrdernessTimestampExtractor with allowed latency of one minute.
Watermark: To my understanding, watermark in Flink and Spark Structured Stream is defined as (max-event-timestamp-seen-so-far - allowed-lateness). Any event whose event timestamp is less than or equal to this watermark will be discarded and ignored in result computations.
Part 1 (Determining Boundaries Of The Window)
Happy (Real-Time) Path
In this scenario several events arrive at the Flink Operator with different event timestamps spanning 12:01 - 12:09. Also, the event timestamps are relatively aligned with our processing time (shown in the X axis below). Since we're dealing with EVENT_TIME characteristic, whether or not an even belongs to a particular event should be determined via its event timestamp.
Old Data Rushing In
In that flow I have assumed the boundaries of the two tumbling windows are 12:00 -- 12:05 and 12:05 -- 12:10 just because we have kicked off the execution of the topology at 12:00. If that assumption is correct (I hope not), then what happens in case of a back-filling situation in which several old events coming in with much older event timestamps and we have kicked off the topology at 12:00 again? (old enough that our lateness allowance does not cover them). Something like the following:
If it goes like that, then our events won't be captured in any window of course, so again, I'm hoping that's not the behavior :)
The other option would be to determine windows' boundaries via the event timestamp of the arriving events. If that's the case, how would that work? The smallest event timestamp noticed becomes the beginning of the first window and from there based on the size (in this case 5 minutes), the consequent boundaries are determined? Cause that approach will have flaws and loopholes too. Can you please explain how does this work and how the boundaries of windows are determined?
Backfilling Events Rushing In
The answer to the previous question will address this as well, but I think it would be helpful to explicitly mention it here. Let's say I have this TumblingEventTimeWindow of size 5 minutes. Then at 12:00 I kick off a backfilling job which rushes in many events to the Flink operator whose timestamps cover the range 10:02 - 10:59; but since this is a backfilling job, the whole execution takes about 3 minutes to finish.
Will the job allocate 12 separate windows and populate them correctly based on the events' event timestamps? What would be the boundaries of those 12 windows? And will I end up with 12 output events each of which having the summed up value of each allocated window?
Part 2 (Unit/Integration Testing Of Such Stateful Operators)
I also have some concerns regarding automated testing of such logic and operators. Best way to manipulate processing time, trigger certain behaviors in such a way that shape desired windows' boundaries for testing purposes. Specially since the stuff that I've read so far on leveraging Test Harnesses seem a bit confusing and can cause some cluttered code potentially which is not that easy to read:
Unit Test Stateful Operators
Lateness Testing of Window in Flink
References
Most of what I've learned in this area and the source of some of my confusion can be found in the following places:
Timestmap Extractors & Watermark Emitters
Event Time Processing & Watermarking
Handling Late Data & Watermarking in Spark
The images in that section of Spark doc were super helpful and educative. But at the same time the way windows' boundaries are aligned with those processing times and not event timestamps, caused some confusion for me.
Also, in that visualization, it seems like the watermark is computed once every 5 minutes since that's the sliding specification of the window. Is that the determining factor for how often the watermark should be computed? How does this work in Flink with regard to different windows (e.g. Tumbling, Sliding, Session and more)?!
HUGE thanks in advance for your help and if you know about any better references with regard to these concepts and their internals working, please let me know.
UPDATES AFTER #snntrable Answer Below
If you run a Job with event time semantics, the processing time at the window operators is completely irrelevant
That is correct and I understand that part. Once you're dealing with EVENT_TIME characteristics, you're pretty much divorced from processing time in your semantics/logic. The reason I brought up the processing time was my confusion with regard to the following key question which still is a mystery to me:
How does the windows' boundaries are computed?!
Also, thanks a lot for clarifying the distinction between out-of-orderness and lateness. The code I was dealing with totally threw me off by having a misnomer (the constructor argument to a class inheriting from BoundedOutOfOrdernessTimestampExtractor was named maxLatency) :/
With that in mind, let me see if I can get this correct with regard to how watermark is computed and when an event will be discarded (or side-outputted):
Out of Orderness Assigner
current-watermark = max-event-time-seen-so-far - max-out-of-orderness-allowed
Allowed Lateness
current-watermark = max-event-time-seen-so-far - allowed-lateness
Regular Flow
current-watermark = max-event-time-seen-so-far
And in any of these cases, whatever event whose event timestamp is less than or equal to the current-watermark, will be discarded (side-outputted), correct?!
And this brings up a new question. When would you wanna use out of orderness as opposed to lateness? Since the current watermark computation (mathematically) can be identical in these cases. And what happens when you use both (does that even make sense)?!
Back To Windows' Boundaries
This is still the main mystery to me. Given all the discussion above, let'e revisit the concrete example I provided and see how the windows' boundaries are determined here. Let's say we have the following scenario (events are in the shape of (value, timestamp)):
Operator kicked off at 12:00 PM (that's the processing time)
Events arriving at the operator in the following order
(1, 8:29)
(5, 8:26)
(3, 9:48)
(7, 9:46)
We have a TumblingEventTimeWindow with size 5 minutes
The window is applied to a DataStream with BoundedOutOfOrdernessTimestampExtractor which has 2 minute maxOutOfOrderness
Also, the window is configured with allowedLateness of 1 minute
NOTE: If you cannot have both out of orderness and lateness or does not make sense, please only consider the out of orderness in the example above.
Finally, can you please layout the windows which will have some events allocated to them and, please specify the boundaries of those windows (beginning and end timestamps of the window). I'm assuming the boundaries are determined by events' timestamps as well but it's a bit tricky to figure them out in concrete examples like this one.
Again, HUGE thanks in advance and truly appreciate your help :)
Original Answer
Watermark: To my understanding, watermark in Flink and Spark Structured Stream is defined as (max-event-timestamp-seen-so-far - allowed-lateness). Any event whose event timestamp is less than or equal to this watermark will be discarded and ignored in result computations.
This is not correct and might be the source of the confusion. Out-of-Orderness and Lateness are different concepts in Flink. With the BoundedOutOfOrdernessTimestampExtractor the watermark is max-event-timestamp-seen-so-far - max-out-of-orderness. More about Allowed Lateness in the Flink Documentation [1].
If you run a Job with event time semantics, the processing time at the window operators is completely irrelevant:
events will be assigned to their windows based on their event time timestamp
time windows will be triggered once the watermarks reaches their maximum timestamp (window end time -1).
events with a timestamp older than current watermark - allowed lateness are discarded or sent to the late data side output [1]
This means, if you start a job at 12:00pm (processing time) and start ingesting data from the past, the watermark will also be (even further) in the past. So, the configured allowedLateness is irrelevant, because the data is not late with respect to even time.
On the other hand, if you first ingest some data from 12:00pm and afterwards data from 10:00pm, the watermark will have already advanced to ~12:00pm before you ingest the old data. In this case the data from 10:00pm will be "late". If it is later than the configured allowedLateness (default=0) it is discarded (default) or sent to a side output (if configured) [1].
Follow Up Answers
The timeline for an event time window is the following:
first element with timestamp within the a window arrives -> state for this window (& key) is created
watermark >= window_endtime - 1 arrives -> window is fired (results are emitted), but state is not discarded
watermark >= window_endtime + allowed_latenes arrives -> state is discarded
Between 2. and 3. events for this window are late, but within the allowed lateness. The events are added to the existing state and - per default - the window is fired on each record emitting a refined result.
After 3. events for this window will be discarded (or sent to the late output sink).
So, yes, it makes sense to configure both. The out of orderness determines, when the window is fired for the first time, while the allowed lateness determines how long the state is kept around to potentially update the results.
Regarding the boundaries: tumbling event time windows have a fixed length, are aligned across keys and start at the unix epoch. Empty windows, don't exist. For your example this means:
(1, 8:29) is added to window (8:25 - 8:29:59:999)
(5, 8:26) is added to window (8:25 - 8:29:59:999)
(3, 9:48) is added to window (9:45 - 9:49:59:999)
(8:25 - 8:29:59:999) is fired because the watermark has advanced to 9:48-0:02=9:46, which is larger than the last timestamp of the window. The window state is also discarded, because the watermark has advanced to 9:46, which is also larger than the end time of the window + the allowed lateness (1 minute)
(7, 9:46) is added to window is added to window (9:45 - 9:49:59:999)
Hope this helps.
Konstantin
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/operators/windows.html#allowed-lateness

Is there an easy way to get the percentage of successful reads of last x minutes?

I have a setup with a Beaglebone Black which communicates over I²C with his slaves every second and reads data from them. Sometimes the I²C readout fails though, and I want to get statistics about these fails.
I would like to implement an algorithm which displays the percentage of successful communications of the last 5 minutes (up to 24 hours) and updates that value constantly. If I would implement that 'normally' with an array where I store success/no success of every second, that would mean a lot of wasted RAM/CPU load for a minor feature (especially if I would like to see the statistics of the last 24 hours).
Does someone know a good way to do that, or can anyone point me in the right direction?
Why don't you just implement a low-pass filter? For every successfull transfer, you push in a 1, for every failed one a 0; the result is a number between 0 and 1. Assuming that your transfers happen periodically, this works well -- and you just have to adjust the cutoff frequency of that filter to your desired "averaging duration".
However, I can't follow your RAM argument: assuming you store one byte representing success or failure per transfer, which you say happens every second, you end up with 86400B per day -- 85KB/day is really negligible.
EDIT Cutoff frequency is something from signal theory and describes the highest or lowest frequency that passes a low or high pass filter.
Implementing a low-pass filter is trivial; something like (pseudocode):
new_val = 1 //init with no failed transfers
alpha = 0.001
while(true):
old_val=new_val
success=do_transfer_and_return_1_on_success_or_0_on_failure()
new_val = alpha * success + (1-alpha) * old_val
That's a single-tap IIR (infinite impulse response) filter; single tap because there's only one alpha and thus, only one number that is stored as state.
EDIT2: the value of alpha defines the behaviour of this filter.
EDIT3: you can use a filter design tool to give you the right alpha; just set your low pass filter's cutoff frequency to something like 0.5/integrationLengthInSamples, select an order of 0 for the IIR and use an elliptic design method (most tools default to butterworth, but 0 order butterworths don't do a thing).
I'd use scipy and convert the resulting (b,a) tuple (a will be 1, here) to the correct form for this feedback form.
UPDATE In light of the comment by the OP 'determine a trend of which devices are failing' I would recommend the geometric average that Marcus Müller ꕺꕺ put forward.
ACCURATE METHOD
The method below is aimed at obtaining 'well defined' statistics for performance over time that are also useful for 'after the fact' analysis.
Notice that geometric average has a 'look back' over recent messages rather than fixed time period.
Maintain a rolling array of 24*60/5 = 288 'prior success rates' (SR[i] with i=-1, -2,...,-288) each representing a 5 minute interval in the preceding 24 hours.
That will consume about 2.5K if the elements are 64-bit doubles.
To 'effect' constant updating use an Estimated 'Current' Success Rate as follows:
ECSR = (t*S/M+(300-t)*SR[-1])/300
Where S and M are the count of errors and messages in the current (partially complete period. SR[-1] is the previous (now complete) bucket.
t is the number of seconds expired of the current bucket.
NB: When you start up you need to use 300*S/M/t.
In essence the approximation assumes the error rate was steady over the preceding 5 - 10 minutes.
To 'effect' a 24 hour look back you can either 'shuffle' the data down (by copy or memcpy()) at the end of each 5 minute interval or implement a 'circular array by keeping track of the current bucket index'.
NB: For many management/diagnostic purposes intervals of 15 minutes are often entirely adequate. You might want to make the 'grain' configurable.

Document classification with incomplete training set

Advice please. I have a collection of documents that all share a common attribute (e.g. The word French appears) some of these documents have been marked as not pertinent to this collection (e.g. French kiss appears) but not all documents are guaranteed to have been identified. What is the best method to use to figure out which other documents don't belong.
Assumptions
Given your example "French", I will work under the assumption that the feature is a word that appears in the document. Also, since you mention that "French kiss" is not relevant, I will further assume that in your case, a feature is a word used in a particular sense. For example, if "pool" is a feature, you may say that documents mentioning swimming pools are relevant, but those talking about pool (the sport, like snooker or billiards) are not relevant.
Note: Although word sense disambiguation (WSD) methods would work, they require too much effort, and is an overkill for this purpose.
Suggestion: localized language model + bootstrapping
Think of it this way: You don't have an incomplete training set, but a smaller training set. The idea is to use this small training data to build bigger training data. This is bootstrapping.
For each occurrence of your feature in the training data, build a language model based only on the words surrounding it. You don't need to build a model for the entire document. Ideally, just the sentences containing the feature should suffice. This is what I am calling a localized language model (LLM).
Build two such LLMs from your training data (let's call it T_0): one for pertinent documents, say M1, and another for irrelevant documents, say M0. Now, to build a bigger training data, classify documents based on M1 and M0. For every new document d, if d does not contain the feature-word, it will automatically be added as a "bad" document. If d contains the feature-word, then consider a local window around this word in d (the same window size that you used to build the LLMs), and compute the perplexity of this sequence of words with M0 and M1. Classify the document as belonging to the class which gives lower perplexity.
To formalize, the pseudo-code is:
T_0 := initial training set (consisting of relevant/irrelevant documents)
D0 := additional data to be bootstrapped
N := iterations for bootstrapping
for i = 0 to N-1
T_i+1 := empty training set
Build M0 and M1 as discussed above using a window-size w
for d in D0
if feature-word not in d
then add d to irrelevant documents of T_i+1
else
compute perplexity scores P0 and P1 corresponding to M0 and M1 using
window size w around the feature-word in d.
if P0 < P1 - delta
add d to irrelevant documents of T_i+1
else if P1 < P0 - delta
add d to relevant documents of T_i+1
else
do not use d in T_i+1
end
end
end
Select a small random sample from relevant and irrelevant documents in
T_i+1, and (re)classify them manually if required.
end
T_N is your final training set. In this above bootstrapping, the parameter delta needs to be determined with experiments on some held-out data (also called development data).
The manual reclassification on a small sample is done so that the noise during this bootstrapping is not accumulated through all the N iterations.
Firstly you should take care of how to extract features of the sample docs. Counting every word is not a good way. You might need some technique like TFIDF to teach the classifier that which words are important to classify and which are not.
Build a right dictionary. In your case, the word French kiss should be a unique word, instead of a sequence of French + kiss. Use the right technique to build a right dictionary is important.
The remain errors in samples are normal, we call it "not linear separable". There're a huge amount of advanced researches on how to solve this problem. For example, SVM (support vector machine) would be what you like to use. Please note that single-layer Rosenblatt perceptron usually shows very bad performance for the dataset which are not linear separable.
Some kinds of neural networks (like Rosenblatt perceptron) can be educated on erroneus data set and can show a better performance than tranier has. Moreover in many cases you should make errors for avoid over-training.
You can mark all unlabeled documents randomly, train several nets and estimate theirs performance on the test set (of course, you should not include unlabeled documents in the test set). After that you can in cycle recalculate weights of unlabeled documents as w_i = sum of quality(j) * w_ij, and then repeate training and the recalculate weight and so on. Because procedure is equivalent to introducing new hidden layer and recalculating it weights by Hebb procedure the overall procedure should converge if your positive and negative sets are lineary separable in some network feature space.

Resources