From Flink documentation I see there are two different window objects:
timeWindow(Time.seconds(5)) and also window(TumblingWindow/SlidingWindow) ....,
I am confused about the difference between them, especially the timeWindow, is it a SlidingWindow or TumblingWindow?
The JavaDoc for timeWindow(Time) explicitly says that it's a shortcut for .window(TumblingEventTimeWindows.of(size)) or .window(TumblingProcessingTimeWindows.of(size)), depending on the time characteristic of the stream. So yes, it's a TumblingWindow.
Related
I'm trying to evaluate Apache Flink for the use case we're currently running in production using custom code.
So let's say there's a stream of events each containing a specific attribute X which is a continuously increasing integer. That is a bunch of contiguous events have this attributes set to N, then the next batch has it set to N+1 etc.
I want to break the stream into windows of events with the same value of X and then do some computations on each separately.
So I define a GlobalWindow and a custom Trigger where in onElement method I check the attribute of any given element against the saved value of the current X (from state variable) and if they differ I conclude that we've accumulated all the events with X=CURRENT and it's time to do computation and increase the X value in the state.
The problem with this approach is that the element from the next logical batch (with X=CURRENT+1) has been already consumed but it's not a part of the previous batch.
Is there a way to put it back somehow into the stream so that it is properly accounted for the next batch?
Or maybe my approach is entirely wrong and there's an easier way to achieve what I need?
Thank you.
I think you are on a right track.
Trigger specifies when a window can be processed and results for a window can be emitted.
The WindowAssigner is the part which says to which window element will be assigned. So I would say you also need to provide a custom implementation of WindowAssigner that will assign same window to all elements with equal value of X.
A more idiomatic way to do this with Flink would be to use stream.keyBy(X).window(...). The keyBy(X) takes care of grouping elements by their particular value for X. You then apply any sort of window you like. In your case a SessionWindow may be a good choice. It will fire for each key after that key hasn't been seen for some configurable period of time.
This approach will be much more robust with regard to unordered data which you must always assume in a stream processing system.
I follow the example of
https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/libs/ml/multiple_linear_regression.html
but in the example the fit function only need one param,but in my code , fit require three params,
mlr.fit(training, fitParameters, fitOperation);
I thought fitParameters may be a alternative for setIterations(),setStepsize()
but what is fitOperation?
The fitOperation parameter is actually an implicit parameter which is filled in automatically by the Scala compiler. It encapsulates the MLR logic.
Since your fit function has 3 parameters, I suspect that you're using FlinkML with Flink's Java API. I would highly recommend you using the Scala API, because otherwise you will have to construct the ML pipelines manually. If you still want to do it, then take a look at the FitOperations defined in the MultipleLinearRegression companion object.
DoubleSerializer and DoubleValueSerializer are all implemented TypeSerializerSingleton interface, they shared same methods, and in DoubleValue, the document shows that, it is a boxed value of java Double.
My question is, since we have DoubleValueSerializer, why we still need DoubleSerializer, what is the design in here?
Thanks for advance!
The DoubleSerializer and DoubleValueSerializer exist because the former serializes java Doubles and the latter serializes DoubleValue instances. These types are different.
The DoubleValue type represents a java Double which implements the Key and Value interface. These interfaces date back to the time when Flink could not directly handle java primitives. There you always had to wrap them in a Value type. Nowadays, there is no necessity anymore to use them directly. However, they are still used internally for some components.
I am splitting some java objects and then aggregating. I am kind of confused how this completion strategy works with Camel (2.15.2). I am using completion size and completion timeout. If I understand correctly, completion timeout does not have much effect in it. Because, there is not much waiting going on here.
Altogether, I have 3000+ objects. But, it seems only a part of it is aggregated. But, if I vary the completion size value, situation changes. If the size is 100, it aggregates around 800, and if it is 200, it aggregates up to around 1600. But, I don't know the size of objects in advance, and so, cannot rely on a assumed number.
Can anyone please explain to me what I am doing wrong here?
If I use eagerCheckCompletion, it aggregates the whole thing in a one go, which I don't want.
Below is my route:
from("direct:specializeddatavalidator")
.to("bean:headerFooterValidator").split(body())
.process(rFSStatusUpdater)
.process(dataValidator).choice()
.when(header("discrepencyList").isNotNull()).to("seda:errorlogger")
.otherwise().to("seda:liveupdater").end();
from("seda:liveupdater?concurrentConsumers=4&timeout=5000")
.aggregate(simple("${in.header.contentType}"),
batchAggregationStrategy())
.completionSize(MAX_RECORDS)
.completionTimeout(BATCH_TIME_OUT).to("bean:liveDataUpdater");
from("seda:errorlogger?concurrentConsumers=4")
.aggregate(simple("${in.header.contentType}"),
batchAggregationStrategy("discrepencyList"))
.completionSize(MAX_RECORDS_FOR_ERRORS)
.completionTimeout(BATCH_TIME_OUT)
.process(errorProcessor).to("bean:liveDataUpdater");
Weird, but if you want to aggregate all the splitted messages you can simply use
.split(body(), batchAggregationStrategy())
And depending on how you want it to work you can use
.shareUnitOfWork().stopOnException()
See http://camel.apache.org/splitter.html for more info
I have seen references to 'zone' in the MsgPack C headers, but can find no documentation on what it is or what it's for. What is it? Furthermore, where's the function-by-function documentation for the C API?
msgpack_zone is an internal structure used for memory management & lifecycle at unpacking time. I would say you will never have to interact with it if you use the standard, high-level interface for unpacking or the alternative streaming version.
To my knowledge, there is no detailed documentation: instead you should refer to the test suite that provides convenient code samples to achieve the common tasks, e.g. see pack_unpack_c.cc and streaming_c.cc.
From what I could gather, it is a move-only type that stores the actual data of a msgpack::object. It very well might intended to be an implementation detail, but it actually leaks into users' code sometimes. For example, any time you want to capture a msgpack::object in a lambda, you have to capture the msgpack::zone object as well. Sometimes you can't use move capture (e.g. asio handlers in some cases will only take copyable handlers, or your compiler doesn't support the feature). To work around this, you can:
msgpack::unpacked r;
while (pac_.next(&r)) {
auto msg = result.get();
io_->post([this, msg, z = std::shared_ptr<msgpack::zone>(r.zone().release())]() {
// msg is valid here
}));
}