Summing a number from a random number source - apache-flink

I'm just starting to learn flink and trying to build a very basic toy example which sums an integer over time and periodically prints the total sum so far
I've created a random number generator source class like this:
// RandomNumberSource.java
public class RandomNumberSource implements SourceFunction<Integer> {
public volatile boolean isRunning = true;
private Random rand;
public RandomNumberSource() {
this.rand = new Random();
}
#Override
public void run(SourceContext<Integer> ctx) throws Exception {
while (isRunning) {
ctx.collect(rand.nextInt(200));
Thread.sleep(1000L);
}
}
#Override
public void cancel() {
this.isRunning = false;
}
}
As you can see, it generates a random number every 1 second
Now how would I go about summing the number that's being generated?
// StreamJob.java
public class StreamingJob {
public static void main(String[] args) throws Exception {
// set up the streaming execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Integer> randomNumber = env.addSource(new RandomNumberSource());
// pseudo code:
// randomNumber
// .window(Time.seconds(5))
// .reduce(0, (acc, i) => acc+i) // (initial value, reducer)
// .sum()
// execute program
env.execute("Flink Streaming Random Number Sum Aggregation");
}
}
I've added pseudo code to explain what I'm trying to do. i.e every 5 seconds, perform a sum of all the numbers and print it out.
I feel like I'm missing something in my approach and might need some guidance on how to do this.

window operator is used for keyed streams. You should instead use windowAll for this task. Here's the snippet:
randomNumber
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.sum(0)
.print()
.setParallelism(1);
Also check this for reference on various window considerations.

Related

Busy time is too high for simple process function

Finally, after a month of research I found the main reason.
The main reason was IP2Location. I am using IP2Location java library to search ip address location in the BIN files. In the peak time, it causes a problem. At least i can avoid to problem by passing IP2Proxy.IOModes.IP2PROXY_MEMORY_MAPPED parameter before reading the bin files.
And also I just found that a few state object doesn't match with POJO standard which causes high load.
I am using flink v1.13, there are 4 task managers (per 16 cpu) with 3800 tasks (default application parallelism is 28)
In my application one operator has always high busy time (around %80 - %90).
If I restart the flink application, then busy time decreases, but after 5-10 hours running busy time increases again.
In the grafana, I can see that busy time for ProcessStream increases. Here is the PromethuesQuery: avg((avg_over_time(flink_taskmanager_job_task_busyTimeMsPerSecond[1m]))) by (task_name)
There is no backpressure in the ProcessStream task. To calculate backPressure time, I am using: flink_taskmanager_job_task_backPressuredTimeMsPerSecond
But I couldn't find any reason for that.
Here is the code :
private void processOne(DataStream<KafkaObject> kafkaLog) {
kafkaLog
.filter(new FilterRequest())
.name(FilterRequest.class.getSimpleName())
.map(new MapToUserIdAndTimeStampMs())
.name(MapToUserIdAndTimeStampMs.class.getSimpleName())
.keyBy(UserObject::getUserId) // returns of type int
.process(new ProcessStream())
.name(ProcessStream.class.getSimpleName())
.addSink(...)
;
}
// ...
// ...
public class ProcessStream extends KeyedProcessFunction<Integer, UserObject, Output>
{
private static final long STATE_TIMER = // 5 min in milliseconds;
private static final int AVERAGE_REQUEST = 74;
private static final int STANDARD_DEVIATION = 32;
private static final int MINIMUM_REQUEST = 50;
private static final int THRESHOLD = 70;
private transient ValueState<Tuple2<Integer, Integer>> state;
#Override
public void open(Configuration parameters) throws Exception
{
ValueStateDescriptor<Tuple2<Integer, Integer>> stateDescriptor = new ValueStateDescriptor<Tuple2<Integer, Integer>>(
ProcessStream.class.getSimpleName(),
TypeInformation.of(new TypeHint<Tuple2<Integer, Integer>>() {}));
state = getRuntimeContext().getState(stateDescriptor);
}
#Override
public void processElement(UserObject value, KeyedProcessFunction<Integer, UserObject, Output>.Context ctx, Collector<Output> out) throws Exception
{
Tuple2<Integer, Integer> stateValue = state.value();
if (Objects.isNull(stateValue)) {
stateValue = Tuple2.of(1, 0);
ctx.timerService().registerProcessingTimeTimer(value.getTimestampMs() + STATE_TIMER);
}
int totalRequest = stateValue.f0;
int currentScore = stateValue.f1;
if (totalRequest >= MINIMUM_REQUEST && currentScore >= THRESHOLD)
{
out.collect({convert_to_output});
state.clear();
}
else
{
stateValue.f0 = totalRequest + 1;
stateValue.f1 = calculateNextScore(stateValue.f0);
state.update(stateValue);
}
}
private int calculateNextScore(int totalRequest)
{
return (totalRequest - AVERAGE_REQUEST ) / STANDARD_DEVIATION;
}
#Override
public void onTimer(long timestamp, KeyedProcessFunction<Integer, UserObject, Output>.OnTimerContext ctx, Collector<Output> out) throws Exception
{
state.clear();
}
}
Since you're using a timestamp value from your incoming record (value.getTimestampMs() + STATE_TIMER), you want to be running with event time, and setting watermarks based on that incoming record's timestamp. Otherwise you have no idea when the timer is actually firing, as the record's timestamp might be something completely different than your current processor time.
This means you also want to use .registerEventTimeTimer().
Without these changes you might be filling up TM heap with uncleared state, which can lead to high CPU load.

Flink DataStream sort program does not output

I have written a small test case code in Flink to sort a datastream. The code is as follows:
public enum StreamSortTest {
;
public static class MyProcessWindowFunction extends ProcessWindowFunction<Long,Long,Integer, TimeWindow> {
#Override
public void process(Integer key, Context ctx, Iterable<Long> input, Collector<Long> out) {
List<Long> sortedList = new ArrayList<>();
for(Long i: input){
sortedList.add(i);
}
Collections.sort(sortedList);
sortedList.forEach(l -> out.collect(l));
}
}
public static void main(final String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
env.getConfig().setExecutionMode(ExecutionMode.PIPELINED);
DataStream<Long> probeSource = env.fromSequence(1, 500).setParallelism(2);
// range partition the stream into two parts based on data value
DataStream<Long> sortOutput =
probeSource
.keyBy(x->{
if(x<250){
return 1;
} else {
return 2;
}
})
.window(TumblingProcessingTimeWindows.of(Time.seconds(20)))
.process(new MyProcessWindowFunction())
;
sortOutput.print();
System.out.println(env.getExecutionPlan());
env.executeAsync();
}
}
However, the code just outputs the execution plan and a few other lines. But it doesn't output the actual sorted numbers. What am I doing wrong?
The main problem I can see is that You are using ProcessingTime based window with very short input data, which surely will be processed in time shorter than 20 seconds. While Flink is able to detect end of input(in case of stream from file or sequence as in Your case) and generate Long.Max watermark, which will close all open event time based windows and fire all event time based timers. It doesn't do the same thing for ProcessingTime based computations, so in Your case You need to assert Yourself that Flink will actually work long enough so that Your window is closed or refer to custom trigger/different time characteristic.
One other thing I am not sure about since I never used it that much is if You should use executeAsync for local execution, since that's basically meant for situations when You don't want to wait for the result of the job according to the docs here.

Flink streaming example that generates its own data

Earlier I asked about a simple hello world example for Flink. This gave me some good examples!
However I would like to ask for a more ‘streaming’ example where we generate an input value every second. This would ideally be random, but even just the same value each time would be fine.
The objective is to get a stream that ‘moves’ with no/minimal external touch.
Hence my question:
How to show Flink actually streaming data without external dependencies?
I found how to show this with generating data externally and writing to Kafka, or listening to a public source, however I am trying to solve it with minimal dependence (like starting with GenerateFlowFile in Nifi).
Here's an example. This was constructed as an example of how to make your sources and sinks pluggable. The idea being that in development you might use a random source and print the results, for tests you might use a hardwired list of input events and collect the results in a list, and in production you'd use the real sources and sinks.
Here's the job:
/*
* Example showing how to make sources and sinks pluggable in your application code so
* you can inject special test sources and test sinks in your tests.
*/
public class TestableStreamingJob {
private SourceFunction<Long> source;
private SinkFunction<Long> sink;
public TestableStreamingJob(SourceFunction<Long> source, SinkFunction<Long> sink) {
this.source = source;
this.sink = sink;
}
public void execute() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Long> LongStream =
env.addSource(source)
.returns(TypeInformation.of(Long.class));
LongStream
.map(new IncrementMapFunction())
.addSink(sink);
env.execute();
}
public static void main(String[] args) throws Exception {
TestableStreamingJob job = new TestableStreamingJob(new RandomLongSource(), new PrintSinkFunction<>());
job.execute();
}
// While it's tempting for something this simple, avoid using anonymous classes or lambdas
// for any business logic you might want to unit test.
public class IncrementMapFunction implements MapFunction<Long, Long> {
#Override
public Long map(Long record) throws Exception {
return record + 1 ;
}
}
}
Here's the RandomLongSource:
public class RandomLongSource extends RichParallelSourceFunction<Long> {
private volatile boolean cancelled = false;
private Random random;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
random = new Random();
}
#Override
public void run(SourceContext<Long> ctx) throws Exception {
while (!cancelled) {
Long nextLong = random.nextLong();
synchronized (ctx.getCheckpointLock()) {
ctx.collect(nextLong);
}
}
}
#Override
public void cancel() {
cancelled = true;
}
}

How to split and merge data (vectors) in Apache's Flink, without using windows

I need to split a cube of integers into vectors, perform some operation on each vector (a simple addition say), and then merge the vectors back into a cube. The vector operations should be performed in parallel (i.e. a vector per stream). The cubes are objects that contain an ID.
I can split the cube into vectors and create a tuple using the cube ID and then use keyBy(id), and create a partition per cube's vectors. However it seems like I have to use a window of some time unit to do this. The application is very latency sensitive so I would prefer to combine the vectors as they arrive, perhaps using some kind of logical clock(I know how many vectors are in a cube), and when the last vector arrives send the reassembled cube downstream. Is this possible in Flink?
Here's a code snippet exemplifying this idea:
//Stream topology..
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Cube> stream = env
//Take cubes from collection and send downstream
.fromCollection(cubes)
//Split the cube(int[][][]) to vectors(int[]) and send downstream
.flatMap(new VSplitter()) //returns tuple with id at pos 1
.keyBy(1)
//For each value in each vector element, add its value with one.
.map(new MapFunction<Tuple2<CubeVector, Integer>, Tuple2<CubeVector, Integer>>() {
#Override
public Tuple2<CubeVector, Integer> map(Tuple2<CubeVector, Integer> cVec) throws Exception {
CubeVector cv = cVec.getField(0);
cv.cubeVectorAdd(1);
cVec.setField(cv, 0);
return cVec;
}
})
//** Merge vectors back to a cube **//
.
.
.
//The cube splitter to vectors..
public static class VSplitter implements FlatMapFunction<Cube, Tuple2<CubeVector, Integer>> {
#Override
public void flatMap(Cube cube, Collector<Tuple2<CubeVector, Integer>> out) throws Exception {
for (CubeVector cv : cubeVSplit(cube)) {
//out.assignTimestamp()
out.collect(new Tuple2<CubeVector, Integer>(cv, cube.getId()));
}
}
}
You could use a FlatMapFunction which keeps appending the CubeVectors until it has seen enough CubeVectors to reconstruct a Cube. The following code snippet should do the trick:
DataStream<Tuple2<CubeVector, Integer> input = ...
input.keyBy(1).flatMap(
new RichFlatMapFunction<Tuple2<CubeVector, Integer>, Cube> {
private static final ListStateDescriptor<CubeVector> cubeVectorsStateDescriptor = new ListStateDescriptor<CubeVector>(
"cubeVectors",
new CubeVectorTypeInformation());
private static final ValueStateDescriptor<Integer> cubeVectorCounterDescriptor = new ValueStateDescriptor<>(
"cubeVectorCounter",
BasicTypeInfo.INT_TYPE_INFO);
private ListState<CubeVector> cubeVectors;
private ValueState<Integer> cubeVectorCounter;
#Override
public void open(Configuration parameters) {
cubeVectors = getRuntimeContext().getListState(cubeVectorsStateDescriptor);
cubeVectorCounter = getRuntimeContext().getState(cubeVectorCounterDescriptor);
}
#Override
public void flatMap(Tuple2<CubeVector, Integer> cubeVectorIntegerTuple2, Collector<Cube> collector) throws Exception {
cubeVectors.add(cubeVectorIntegerTuple2.f0);
final int oldCounterValue = cubeVectorCounter.value();
final int newCounterValue = oldCounterValue + 1;
if (newCounterValue == NUMBER_CUBE_VECTORS) {
Cube cube = createCube(cubeVectors.get());
cubeVectors.clear();
cubeVectorCounter.update(0);
collector.collect(cube);
} else {
cubeVectorCounter.update(newCounterValue);
}
}
});

Flink: ValueState on RichFlatMapFunktion always returns null

I try to calculate the highest amount of found hashtags in a given Tumbling window.
For this I do kind of a "word count" for hashtags and sum them up. This works fine. After this, I try to find the hashtag with the highest order in the given window. I use a RichFlatMapFunction for this and ValueState to save the current maximum of the appearance of a single hashtag, but this doesn't work.
I have debugged my code and find out that the value of the ValueState "maxVal" is in every flatMap step "null". So the update() and the value() method doesn't work in my scenario.
Do I misunderstood the RichFlatMap function or their usage?
Here is my code, everything except the last flatmap function is working as expected:
public class TwitterHashtagCount {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
DataStream<String> tweetsRaw = env.addSource(new TwitterSource(TwitterConnection.getTwitterConnectionProperties()));
DataStream<String> tweetsGerman = tweetsRaw.filter(new EnglishLangFilter());
DataStream<Tuple2<String, Integer>> tweetHashtagCount = tweetsGerman
.flatMap(new TwitterHashtagFlatMap())
.keyBy(0)
.timeWindow(Time.seconds(15))
.sum(1)
.flatMap(new RichFlatMapFunction<Tuple2<String, Integer>, Tuple2<String, Integer>>() {
private transient ValueState<Integer> maxVal;
#Override
public void open(Configuration parameters) throws Exception {
ValueStateDescriptor<Integer> descriptor =
new ValueStateDescriptor<>(
// state name
"max-val",
// type information of state
TypeInformation.of(Integer.class));
maxVal = getRuntimeContext().getState(descriptor);
}
#Override
public void flatMap(Tuple2<String, Integer> value, Collector<Tuple2<String, Integer>> out) throws Exception {
Integer maxCount = maxVal.value();
if(maxCount == null) {
maxCount = 0;
maxVal.update(0);
}
if(value.f1 > maxCount) {
maxVal.update(maxCount);
out.collect(new Tuple2<String, Integer>(value.f0, value.f1));
}
}
});
tweetHashtagCount.print();
env.execute("Twitter Streaming WordCount");
}
}
I'm wondering why the code you've shared runs at all. The result of sum(1) is non-keyed stream, and the managed state interface you are using expects a keyed stream, and will keep a separate instance of the state for each key. I'm surprised you're not getting an error saying "Keyed state can only be used on a 'keyed stream', i.e., after a 'keyBy()' operation."
Since you've previously windowed the stream, if you do key it again (with the same key) before the RichFlatMapFunction, each key will occur once and the maxVal will always be null.
Something like this might do what you intend, if your goal is to find the max across all hashtags in each time window:
tweetsGerman
.flatMap(new TwitterHashtagFlatMap())
.keyBy(0)
.timeWindow(Time.seconds(15))
.sum(1)
.timeWindowAll(Time.seconds(15))
.max(1)

Resources