I want to show numRecordsIn for an operator in Flink and for doing this I have been following ppt by data artisans at here. code for the counter is given below
public static class mapper extends RichMapFunction<String,String>{
public Counter counter;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
this.counter = getRuntimeContext()
.getMetricGroup()
.counter("numRecordsIn");
}
#Override
public String map(String s) throws Exception {
counter.inc();
System.out.println("counter val " + counter.toString());
return null;
}
}
The problem is that how do I specify which operator I want to show number_of_Records_In?
Metric counter are exposed via Flink's metric system. In order to take a look at them, you have to configure a metric reporter. A description how to register a metric reporter can be found here.
Flink includes a number of built-in metrics, including numRecordsIn. So if that's what you want to measure, there's no need to write any code to implement that particular measurement. Similarly for numRecordsInPerSecond, and a host of others.
The code you asked about causes the numRecordsIn counter to be incremented for the operator in which the metric is being used.
A good way to better understand the metrics system is to bring up a simple streaming job and look at the metrics in Flink's web ui. I also found it really helpful to query the monitoring REST api while a job was running.
Related
I have a Fink topology that consists of multiple Map and FlatMap transformations. The source/sink are from/to Kafka. The Kakfa records are of type Envelope (defined by someone else), and are not marked as "serializable". I want to Unit test this topology.
I defined a simple SourceFunction that returns a list of Envelope as the source:
public class MySource extends RichParallelSourceFunction<Envelope> {
private List<Envelope> input;
public MySource(List<Envelope> input) {
this.input = input;
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
}
#Override
public void run(SourceContext<Envelope> ctx) throws Exception {
for (Envelope listElement : inputOfSubtask) {
ctx.collect(listElement);
}
}
#Override
public void cancel() {}
}
I am using MiniClusterWithClientResource to Unit test the topology. I ran onto two problems:
I need to make MySource serializable, as Flink wants/needs to serialize the source. As a workaround, I make input transient. The allowed the code to compile.
Then I ran into the runtime error:
org.apache.flink.api.common.functions.InvalidTypesException: The return type of function 'Custom Source' could not be determined automatically, due to type erasure. You can give type information hints by using the returns(...) method on the result of the transformation call, or by letting your function implement the 'ResultTypeQueryable' interface.
I am trying to understand why I am getting this error, which I was not getting before when the topology is consuming from a kafka cluster using a KafkaConsumer. I found a workaround by providing the Type info using the following:
.returns(TypeInformation.of(Envelope.class))
However, during runtime, after deserialization, input is set to null (obviously, as there is no deserialization method defined.).
Questions:
Can someone please help me understand why I am getting the InvalidTypesException exception?
Why if MySource being deserialized/serialized? Is there a way I can void this while usingMiniClusterWithClientResource?
I could hack some writeObject() and readObject() method in MySource. But I prefer to avoid that route. Is it possible to use some framework / class to test the Topology without providing a Source (and Sink) that is Serializable? It would be great if I could use something like KeyedOneInputStreamOperatorTestHarness that I could pass as topology, and avoid the whole deserialization / serialization step in the beginning.
Any ideas / pointers would be greatly appreciated.
Thank you,
Ahmed.
"why I am getting the InvalidTypesException exception?"
Not sure, usually I'd need to see the workflow definition to understand where the type information is getting dropped.
"Why if MySource being deserialized/serialized?"
Because Flink distributes operators to multiple tasks on multiple machines by serializing them, then sending over the network, and then deserializing.
"Is there a way I can void this while using MiniClusterWithClientResource?"
Yes. Since the MiniCluster runs in a single JVM, you can use a static ConcurrentLinkedQueue to hold all of the Envelope records, and your MySource just reads from this queue.
Nit: Your MySource should set a transient boolean running flag to true in the open() method, false in the cancel() method, and check it in the run() method's loop.
I want to evaluate the time costed between an event reaches the system and get finished, and I think getting ingestion time will help, but how to do get it?
You probably want to use latency tracking. Alternatively, you can add the processing time directly after the source in a chained process function (with Context->TimerService#currentProcessingTime()).
Based on the reply from David, to get the ingest time we can chain the process method with source.
Below code shows the way to get the ingest time. Also in case the same need to be used for metrics to get the difference between ingest time & event time, I have used histogram metric group to do that.
Below code snippet might help you to better understand.
DataStream<EventDataMapping> text = env
.fromSource(source, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)),"Kafka Source")
.process(new ProcessFunction<EventDataMapping, EventDataMapping>() {
private transient DescriptiveStatisticsHistogram eventVsIngestionTimeLag;
private static final int EVENT_TIME_LAG_WINDOW_SIZE = 10_000;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
eventVsIngestionTimeLag = getRuntimeContext().getMetricGroup().histogram("eventVsIngestionTimeLag",
new DescriptiveStatisticsHistogram(EVENT_TIME_LAG_WINDOW_SIZE));
}
#Override
public void processElement(EventDataMapping eventDataMapping, Context context, Collector<EventDataMapping> collector) throws Exception {
LOG.info("process element event time "+context.timestamp()+" current ingestTime "+context.timerService().currentProcessingTime());
eventVsIngestionTimeLag.update(context.timerService().currentProcessingTime() - context.timestamp());
}
}).returns(EventDataMapping.class);
I am using Apache Flink to perform analytics on streaming data.
I am using a dependency whose object takes more than 10 secs to create as it is reads several files present in hdfs before initialisation.
If I initialise the object in open method I get a timeout Exception and if in the constructor of a sink/flatmap, I get serialisation exception.
Currently I am using static block to initialise the object in some other class, using Preconditions.checkNotNull(MGenerator.mGenerator) in main file and then it's working if used in a flatmap of sink.
Is there a way to create a non serializable dependency's object which might take more than 10 secs to be initialised in Flink's flatmap or sink?
public class DependencyWrap {
static MGenerator mGenerator;
static {
final String configStr = "{}";
final Config config = new Gson().fromJson(config, Config.class);
mGenerator = new MGenerator(config);
}
}
public class MyStreaming {
public static void main(String[] args) throws Exception {
Preconditions.checkNotNull(MGenerator.mGenerator);
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(parallelism);
...
input.flatMap(new RichFlatMapFunction<Map<String,Object>,List<String>>() {
#Override
public void open(Configuration parameters) {
}
#Override
public void flatMap(Map<String,Object> value, Collector<List<String>> out) throws Exception {
out.collect(MFVGenerator.mfvGenerator.generateMyResult(value.f0, value.f1));
}
});
}
}
Also, Please correct me if I am wrong about the question.
Doing it in the Open method is 100% the right way to do it. Is Flink giving you a timeout exception, or the object?
As a last ditch method, you could wrap your object in a class that contains both the object and it's JSON string or Config (is Config serializable?) with the object marked transient and then override the ReadObject/WriteObject methods to call the constructor. If the mGenerator object itself is stateless (and you'll have other problems if it's not), the serialization code should get called only once when jobs are distributed to taskmanagers.
Using open is usually the right place to load external lookup sources. The timeout is a bit odd, maybe there is a configuration around it.
However, if it's huge using a static loader (either static class as you did or singleton) has the benefit that you only need to load it once for all parallel instances of the task on the same task manager. Hence, you save memory and CPU time. This is especially true for you, as you use the same data structure in two separate tasks. Further, the static loader can be lazily initialized when it's used for the first time to avoid the timeout in open.
The clear downside of this approach is that the testability of your code suffers. There are some ways around that, which I could expand if there is interest.
I don't see a benefit of using the proxy serializer pattern. It's unnecessarily complex (custom serialization in Java) and offers little benefit.
We are in the middle of testing scaling ability of Flink. But we found that scaling not working, no matter increase more slot or increase number of Task Manager. We would expect a linear, if not close-to-linear scaling performance but the result even show degradation. Appreciated any comments.
Test Details,
-VMWare vsphere
-Just a simple pass through test,
- auto gen source 3mil records, each 1kb in size, parallelism=1
- source pass into next map operator, which just return the same record, and sent counter to statsD, parallelism is in cases = 2,4,6
3 TM, total 6 slots(2/TM) each JM/TM has 32 vCPU, 100GB memory
Result:
2 slots: 26 seconds, 3mil/26=115k TPS
4 slots: 23 seconds, 3mil/23=130k TPS
6 slots: 22 seconds, 3mil/22=136k TPS
As shown the scaling is almost nothing. Any clue? Thanks.
You really should be using a RichParallelSourceFunction. If you care about making the records from different instances of the source distinct, you can get ahold of each instance's index from the RuntimeContext, which is available via the getRuntimeContext() method in the RichFunction interface.
Also, Flink has a built-in statsd metrics reporter that you should be using instead of rolling your own. Moreover, numRecordsIn, numRecordsOut, numRecordsInPerSecond, and numRecordsOutPerSecond are already being computed for you, so no need to create this instrumentation yourself. You can also access these metrics via Flink's web interface, or the REST API.
As for why you might be experiencing poor scalability with the Kafka consumer, there are many things that could cause this. If you are using event time processing, then idle partitions could be holding things up (see https://issues.apache.org/jira/browse/FLINK-5479). If the stream is keyed, then data skew could be an issue. If you are connecting to an external database or service, then it could easily be a bottleneck. If checkpointing is misconfigured it could cause this. Or insufficient network capacity.
I would start to debug this by looking at some key metrics in the Flink web UI. Is the load well balanced across the sub-tasks, or is it skewed? You could turn on latency tracking and see if one of the kafka partitions is misbehaving (by inspecting the latency at the sink(s), which will be reported on a per-partition basis). And you could look for back pressure.
please refer to the sample code,
public class passthru extends RichMapFunction<String, String> {
public void open(Configuration configuration) throws Exception {
... ...
stats = new NonBlockingStatsDClient();
}
public String map(String value) throws Exception {
... ...
stats.increment();
return value;
}
}
public class datagen extends RichSourceFunction<String> {
... ...
public void run(SourceContext<String> ctx) throws Exception {
int i = 0;
while (run){
String idx = String.format("%09d", i);
ctx.collect("{\"<a 1kb json content with idx in certain json field>\"}");
i++;
if(i == loop)
run = false;
}
}
... ...
}
public class Job {
public static void main(String[] args) throws Exception {
... ...
DataStream<String> stream = env.addSource(new datagen(loop)).rebalance();
DataStream<String> convert = stream.map(new passthru(statsdUrl));
env.execute("Flink");
}
}
the reductionState code,
dataStream.flatMap(xxx).keyBy(new KeySelector<xxx, AggregationKey>() {
public AggregationKey getKey(rec r) throws Exception {
... ...
}
}).process(new Aggr());
public class Aggr extends ProcessFunction<rec, rec> {
private ReducingState<rec> store;
public void open(Configuration parameters) throws Exception {
store= getRuntimeContext().getReducingState(new ReducingStateDescriptor<>(
"reduction store", new ReduceFunction<rec>() {
... ...
}
public void processElement(rec r, Context ctx, Collector<rec> out)
throws Exception {
... ...
store.add(r);
I want to do performance analysis of Flink CEP engine and I came across these classes
org.apache.flink.optimizer.costs.CostEstimator;
org.apache.flink.optimizer.costs.Costs;
org.apache.flink.optimizer.costs.DefaultCostEstimator;
But the issue is that I don't know how to use either of this class. Can someone provide me with a code or insinuation regarding, how can I find the costs estimation for operators { join for example} in Flink.
Below is the code for a join that I am performing in Flink
DataStream<JoinedEvent> joinedEventDataStream = stream1.join(stream2).where(new KeySelector<RRIntervalStreamEvent, Long>() {
#Override
public Long getKey(RRIntervalStreamEvent rrIntervalStreamEvent) throws Exception {
return rrIntervalStreamEvent.getTime();
}
})
.equalTo(new KeySelector<qrsIntervalStreamEvent, Long>() {
#Override
public Long getKey(qrsIntervalStreamEvent qrsIntervalStreamEvent) throws Exception {
return qrsIntervalStreamEvent.getTime();
}
})
.window(TumblingEventTimeWindows.of(Time.milliseconds(1000)))
.apply(new JoinFunction<RRIntervalStreamEvent, qrsIntervalStreamEvent, JoinedEvent>() {
#Override
public JoinedEvent join(RRIntervalStreamEvent rr, qrsIntervalStreamEvent qrs) throws Exception {
//getting the cost -- just checking
// costs.getCpuCost();
return new JoinedEvent(rr.getTime(),rr.getSensor_id(),qrs.getSensor_id(),rr.getRRInterval(),qrs.getQrsInterval());
}
});
how can I compute the cost for this join?
The cost classes belong to the optimizer of the DataSet API (Flink's batch processing API) while the CEP library is built on the DataStream API. The DataStream API does not leverage the DataSet API.
The CEP library and the DataSet optimizer are completely unrelated. Hence, it is not possible to use this code to estimate the cost of a CEP pattern. I'm also not aware of another built-in method to estimate the cost of a CEP pattern (or any other DataStream program).