Apache Flink offers the possibility to easy adapt operators. I am interested in the internal processings and want to log things that happen inside an operator. For this, a logger-object is handed out to the operator.
public class LogSink extends RichSinkFunction<TaxiRide> {
private static final Logger log = LoggerFactory.getLogger("myLogger");
public LogSink() {
String msg = "Log Sink initialized";
log.info(msg);
}
#Override
public void invoke(TaxiRide ride, Context context) throws Exception {
log.info("Name: " + ride.getName());
}
}
}
In my main method on the main server (master), I initialize the operator. So, the message "log sink initialized" appears in my custom log file, as desired.
But the log messages (e.g. "Name: TaxiRide324") that are logged within invoke() - which is called by a slave, e.g. another JVM - are written to in flinks taskexecutor.log.
I assume, this is because of the distributed processing. The TaskManager and JobManager have different JVMs and so the initialized logger is not used/seen by the JobManager in the execution. (But there is interestingly no NullPointerExeption...)
So my question is: How could I achieve to share the objectss between initialization and execution of an inner class on a distributed flink cluster?
Related
I have a custom source which generates some events every x mins. I have referred this file and my code is something like below
public class PeriodicSourceGenerator extends RichParallelSourceFunction<GenericMetric> {
private transient AtomicBoolean isRunning;
#Override
public void open(final Configuration c) throws Exception {
isRunning = new AtomicBoolean(true);
}
#Override
public void run(SourceContext<GenericMetric> ctx) throws Exception {
while (isRunning.get()) {
//noinspection BusyWait
Thread.sleep(300000); // 5 mins
final long ts = System.getCurrentTimeMillis();
final MetricStore.MetricPoint mp = new MetricStore.MetricPoint(ts, 1, -1);
synchronized (ctx.getCheckpointLock()) {
ctx.collectWithTimestamp(new GenericMetric(mk, MetricName.vRNI_internal_droppedTx_flow_absolute_latest_number, mp), ts);
ctx.collectWithTimestamp(new GenericMetric(mk, MetricName.vRNI_internal_droppedRx_flow_absolute_latest_number, mp), ts);
}
}
logger.info("Job cancelled. Shutting Down Periodic Source Generator");
}
#Override
public void cancel() {
isRunning.set(false);
}
}
I am running multiple pipelines in a single flink job which looks something like below.
I am running flink with the default operator chaining and slot sharing enabled. All my operators have same parallelism, 30 and I am having 5 task managers so that each task manager has 6 slots.
Can someone let me know how will the sleep in PeriodicSourceGenerator pipeline affect the Collection Source pipeline? My understanding is that sleep will make the PeriodicSource generator pipeline context switched by Collection Source pipeline and the entire slot will not get paused for 5 mins. Is my understanding correct?
Flink Version - 1.13.2
Sleeping in one operator won't pause the entire slot -- just the task containing that operator. In this case, sleeping in PeriodicSourceGenerator will not affect the Collection Source pipeline, since these pipelines aren't connected.
In general, you should avoid sleeping (or blocking) in the main task thread. This has negative consequences, such as blocking checkpointing for the entire job. In this specific case, it's okay to sleep the way you're doing it: i.e., outside of the checkpoint lock.
I'm just learning Apache Flink and here is the Word Count sample:
https://ci.apache.org/projects/flink/flink-docs-stable/getting-started/tutorials/local_setup.html
I works but I have something that can't understand clearly.
Flink have three parts: JobManager, TaskManager and JobClient. As my understanding, the java code of the class SocketWindowWordCount should be a part of JobClient, this class should send what it asks to do to the JobClient then the JobClient can send the tasks to the JobManager.
Am I right?
If I'm right, I don't know which part of code in the file SocketWindowWordCount.java is responsible to send what it asks to do to the JobClient.
Is listening on the port also a part of the task which will be sent to the JobManager then to TaskManager?
// get input data by connecting to the socket
DataStream<String> text = env.socketTextStream("localhost", port, "\n");
// parse the data, group it, window it, and aggregate the counts
DataStream<WordWithCount> windowCounts = text
.flatMap(new FlatMapFunction<String, WordWithCount>() {
#Override
public void flatMap(String value, Collector<WordWithCount> out) {
for (String word : value.split("\\s")) {
out.collect(new WordWithCount(word, 1L));
}
}
})
.keyBy("word")
.timeWindow(Time.seconds(5), Time.seconds(1))
.reduce(new ReduceFunction<WordWithCount>() {
#Override
public WordWithCount reduce(WordWithCount a, WordWithCount b) {
return new WordWithCount(a.word, a.count + b.count);
}
});
// print the results with a single thread, rather than in parallel
windowCounts.print().setParallelism(1);
Is all of the codes above a part of the task?
In a word, I kind of understand the architecture of Flink but I want to know more details about how the JobClient works.
Your program itself is the JobClient from the architectural point of view. In particular, you have dependencies on the JobClient that are used when you execute the DataStream.
All of your code is the task definition that gets serialized and sent to the JobManager, which distributes it to the TaskManager.
You left out the "most" important part of the program
env.execute("Socket Window WordCount");
That is actually triggering the JobClient to package the DataStream program and send it to the configured JobManager.
I am using Apache Flink to perform analytics on streaming data.
I am using a dependency whose object takes more than 10 secs to create as it is reads several files present in hdfs before initialisation.
If I initialise the object in open method I get a timeout Exception and if in the constructor of a sink/flatmap, I get serialisation exception.
Currently I am using static block to initialise the object in some other class, using Preconditions.checkNotNull(MGenerator.mGenerator) in main file and then it's working if used in a flatmap of sink.
Is there a way to create a non serializable dependency's object which might take more than 10 secs to be initialised in Flink's flatmap or sink?
public class DependencyWrap {
static MGenerator mGenerator;
static {
final String configStr = "{}";
final Config config = new Gson().fromJson(config, Config.class);
mGenerator = new MGenerator(config);
}
}
public class MyStreaming {
public static void main(String[] args) throws Exception {
Preconditions.checkNotNull(MGenerator.mGenerator);
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(parallelism);
...
input.flatMap(new RichFlatMapFunction<Map<String,Object>,List<String>>() {
#Override
public void open(Configuration parameters) {
}
#Override
public void flatMap(Map<String,Object> value, Collector<List<String>> out) throws Exception {
out.collect(MFVGenerator.mfvGenerator.generateMyResult(value.f0, value.f1));
}
});
}
}
Also, Please correct me if I am wrong about the question.
Doing it in the Open method is 100% the right way to do it. Is Flink giving you a timeout exception, or the object?
As a last ditch method, you could wrap your object in a class that contains both the object and it's JSON string or Config (is Config serializable?) with the object marked transient and then override the ReadObject/WriteObject methods to call the constructor. If the mGenerator object itself is stateless (and you'll have other problems if it's not), the serialization code should get called only once when jobs are distributed to taskmanagers.
Using open is usually the right place to load external lookup sources. The timeout is a bit odd, maybe there is a configuration around it.
However, if it's huge using a static loader (either static class as you did or singleton) has the benefit that you only need to load it once for all parallel instances of the task on the same task manager. Hence, you save memory and CPU time. This is especially true for you, as you use the same data structure in two separate tasks. Further, the static loader can be lazily initialized when it's used for the first time to avoid the timeout in open.
The clear downside of this approach is that the testability of your code suffers. There are some ways around that, which I could expand if there is interest.
I don't see a benefit of using the proxy serializer pattern. It's unnecessarily complex (custom serialization in Java) and offers little benefit.
I want to show numRecordsIn for an operator in Flink and for doing this I have been following ppt by data artisans at here. code for the counter is given below
public static class mapper extends RichMapFunction<String,String>{
public Counter counter;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
this.counter = getRuntimeContext()
.getMetricGroup()
.counter("numRecordsIn");
}
#Override
public String map(String s) throws Exception {
counter.inc();
System.out.println("counter val " + counter.toString());
return null;
}
}
The problem is that how do I specify which operator I want to show number_of_Records_In?
Metric counter are exposed via Flink's metric system. In order to take a look at them, you have to configure a metric reporter. A description how to register a metric reporter can be found here.
Flink includes a number of built-in metrics, including numRecordsIn. So if that's what you want to measure, there's no need to write any code to implement that particular measurement. Similarly for numRecordsInPerSecond, and a host of others.
The code you asked about causes the numRecordsIn counter to be incremented for the operator in which the metric is being used.
A good way to better understand the metrics system is to bring up a simple streaming job and look at the metrics in Flink's web ui. I also found it really helpful to query the monitoring REST api while a job was running.
Using custom partitioning in Apache Flink, we specify a key for each record to be assigned to a particular taskmanager.
Consider we broadcast a dataset to all of the nodes, taskmanagers. Is there any While in a map or faltmap to get the taskmanagef Id or not?
A custom partitioner does not assign records to a TaskManager but to a specific parallel task instance of the subsequent operator (a TM can execute multiple parallel task instances of the same operator).
You can access the ID of a parallel task instance, be extending a RichFunction, e.g., extend a RichMapFunction instead of implementing a MapFunction. Rich functions are available for all transformation. A RichFunction gives access to the RuntimeContext which tells you the ID of the parallel task instance:
public static class MyMapper extends RichMapFunction<Long, Long> {
#Override
public void open(Configuration config) {
int pId = getRuntimeContext().getIndexOfThisSubtask();
}
#Override
public Long map(Long value) throws Exception {
// ...
}
}