I'm trying to understand IntervalJoin operation on Flink and got a question.
Let's assume we have three streams A, B and C.
Here, we interval join two streams each like A-C and B-C.
In java code, it would show like below.
// join stream A and stream C
SingleOutputStreamOperator<SensorReadingOutput> joined1 = A
.intervalJoin(C)
.between(Time.seconds(-1), Time.seconds(0))
.process(new IntervalJoinFunction());
// join stream B and stream C
SingleOutputStreamOperator<SensorReadingOutput> joined2 = B
.intervalJoin(C)
.between(Time.seconds(-1), Time.seconds(0))
.process(new IntervalJoinFunction());
As we see, stream C is joined twice.
Here, can the stream C be shared among two streams A and B?
That is, does the stream C exist as single or duplicated(copy) for each A and B?
I am confusing because of two points in IntervalJoin operation.
Every time we call .process at last of interval join, we create new IntervalJoinOperator. I think stream C would be copied.
In IntervalJoinOperator, the records are cleaned up using internal timer service that is triggered by the event time based watermark. Stream A and B would have different watermark and I think it would affect the stream C's retention period, so stream C should be copied and managed individually.
However, when I made a test code to see if three streams records with the same key are collected in same task instance, they do.
Anybody knows the fact? Thank you!
For someone who wonders same question, the answer is 'they don't share the stream'.
Instead, another duplicated stream is created for another IntervalJoin.
I've done some tests to print the buffer's address inside the IntervalJoinOperator.
For A-C and B-C joining, the same value of record C which is joined with A and B both shows different address.
If the stream is shared, address of record in stream C would be same.
I think this is because of two reasons.
Whenever .intervalJoin is called for keyedStream, new IntervalJoinOperator is created and it contains its own buffer. New buffers would be created every time thus sharing the stream does not make sense.
Also, the watermark of stream A and B would be different. The watermark decides the retention period of buffers in IntervalJoinOperator, thus sharing the buffer also does not make sense here.
Related
So I have 2 text files which arbitrarily contain an matrix whose size I don't know,I have a program running parallelly which computes the matrix multiplication of these two,both these programs(p1 and p2) will be running in a round robin fashion for some time quantum t, I will be using threads to parallelly read the files in p1,and have to pass these to P2 simultaneously, so i was thinking that i will be reading file 1 row wise and file 2 column wise and pass these to p2 so that whenever p1 gets preempted in by p2 ,p2 has something to work on rather than wait for turn of p1 again until it reads the whole matrix, since during the multiplication we need the rows from first matrix and columns from second one
while searching for ways to read the file column wise all solutions I found were to read the whole file simultaneously and parse it into columns or something like that.
what i want to know is how to read the columns of second file without reading the rest so that p2 gets the required data to start the multiplication without waiting for the whole matrix
any other way to do this without reading columns is also welcome
I know what you are trying to do with this question. You are besmirching the name of your institution and I am incredibly heartbroken and disappointed in the actions you are taking.
I wish you had read our honorable Professor's words on the difference between discussion and plagiarism. Will you be mentioning the entire internet as your collaborator in the report?
Indeed, does your group even know this reckless, shameful, inhuman, unfair, unprincipled and disrespectful action you are attempting to do?
I would personally, with a heavy heart, recommend you immediately look inside yourself, and find a way to answer this question that satisfies your inner soul, with honesty, decency, integrity, and honor.
The javadoc for the DataStream#assignAscendingTimestamps
* Assigns timestamps to the elements in the data stream and periodically creates
* watermarks to signal event time progress.
*
* This method is a shortcut for data streams where the element timestamp are known
* to be monotonously ascending within each parallel stream.
* In that case, the system can generate watermarks automatically and perfectly
* by tracking the ascending timestamps.
This method assumes that the the element timestamp are known to be monotonously ascending within each parallel stream. But in practice, almost no stream can give such guarantee that event timestamps are in ascending order.
I would like to conclude that this method should never be used,but I would ask if I have missed something(eg, when to use it)
generally I agree, it can be rarely used in practice. An exception is the following: If Kafka is used as a source with LogAppendTime, timestamp are in order per-partition. You can then use per-partition watermarking in Flink [1] with the AscendingTimestampExtractor and will have pretty optimal watermarking.
Cheers,
Konstantin
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/connectors/kafka.html#kafka-consumers-and-timestamp-extractionwatermark-emission
After reading the source code DataStream#assignAscendingTimestamps, it is using AscendingTimestampExtractor to extract the timestamp.
AscendingTimestampExtractor will keep the largest event timestamp seen so far. If the event time is out of order, it will print a log to warn that monotonously ascending timestamps is violated.
So, I think this class may be handy in practice for the case that doesn't allow laziness(the watermark may keep growing).
Background
We have 2 streams, let's call them A and B.
They produce elements a and b respectively.
Stream A produces elements at a slow rate (one every minute).
Stream B receives a single element once every 2 weeks. It uses a flatMap function which receives this element and generates ~2 million b elements in a loop:
(Java)
for (BElement value : valuesList) {
out.collect(updatedTileMapVersion);
}
The valueList here contains ~2 million b elements
We connect those streams (A and B) using connect, key by some key and perform another flatMap on the connected stream:
streamA.connect(streamB).keyBy(AClass::someKey, BClass::someKey).flatMap(processConnectedStreams)
Each of the b elements has a different key, meaning there are ~2 million keys coming from the B stream.
The Problem
What we see is starvation. Even though there are a elements ready to be processed they are not processed in the processConnectedStreams.
Our tries to solve the issue
We tried to throttle stream B to 10 elements in a 1 second by performing a Thread.sleep() every 10 elements:
long totalSent = 0;
for (BElement value : valuesList) {
totalSent++;
out.collect(updatedTileMapVersion);
if (totalSent % 10 == 0) {
Thread.sleep(1000)
}
}
The processConnectedStreams is simulated to take 1 second with another Thread.sleep() and we have tried it with:
* Setting parallelism of 10 to all the pipeline - didn't work
* Setting parallelism of 15 to all the pipeline - did work
The question
We don't want to use all these resources since stream B is activated very rarely and for stream A elements having high parallelism is an overkill.
Is it possible to solve it without setting the parallelism to more than the number of b elements we send every second?
It would be useful if you shared the complete workflow topology. For example, you don't mention doing any keying or random partitioning of the data. If that's really the case, then Flink is going to pipeline multiple operations in one task, which can (depending on the topology) lead to the problem you're seeing.
If that's the case, then forcing partitioning prior to the processConnectedStreams can help, as then that operation will be reading from network buffers.
I'd like to batch process two files with Apache Flink, one after the other.
For a concrete example: suppose I want to assign an index to each line, such that lines from the second file follow the first. Instead of doing so, the following code interleaves lines in the two files:
val env = ExecutionEnvironment.getExecutionEnvironment
val text1 = env.readTextFile("/path/to/file1")
val text2 = env.readTextFile("/path/to/file2")
val union = text1.union(text2).flatMap { ... }
I want to make sure all of text1 is sent through the flatMap operator first, and then all of text2. What is the recommended way to do so?
Thanks in advance for the help.
DataSet.union() does not provide any order guarantees across inputs. Records from the same input partition will remain in order but will be merged with records from the other input.
But there is a more fundamental problem. Flink is a parallel data processor. When processing data in parallel, a global order cannot be preserved. For example, when Flink reads files in parallel, it tries to split these files and process each split independently. The splits are handed out without any particular order. Hence, the records of a single file are already shuffled. You would need to set the parallelism of the whole job to 1 and implement a custom InputFormat to make this work.
You can make that work, but it won't in parallel and you need to tweak many things. I don't think that Flink is the best tool for such a task.
Have you considered using simple unix commandline tools to concatenate your files?
Does Flink handle out-of-order tuples even in case one does not use a windowing operator?
For example:
withTimestampsAndWatermarks
.keyBy(...)
.map(...) // some stateful function
.addSink(...);
Will map wait to process elements until receiving the correct watermark or will it process the elements without waiting?
The problem is that the partitioned state that map holds could be affected by the out-of-order processing of tuples.
Thank you in advance
The short answer is no. Map operator doesn't work with watermarks at all.
You will get elements in the same order as in the input stream.
For further reference check the implementation of StreamMap operator where you can see watermark elements are just forwarded to the output.
Github source code