Is the number of Kafka partitions accessible when using Camel-Kafka? - apache-camel

When using camel-kafka, is there a way to either a) invoke the default Kafka partitioner, and/or b) determine the number of partitions for a topic for use in a partitioning algorithm? In a Camel Processor I currently have
exchange.getIn().setHeader(KafkaConstants.PARTITION_KEY, partition);
but I can't choose a value for partition without knowing how many partitions are available. Can I do this with Camel, or do I need to "go native" and invoke KafkaProducer.partitionsFor?

I found the answer. I can invoke the default partitioner simply by omitting to set the KafkaConstants.PARTITION_KEY header.
I can use a custom implementation of Partitioner by specifying it as a Camel option:
from(...)
...
.to("kafka:{{kafka.host}}:{{kafka.port}}"
+ "?topic={{kafka.topic}}"
+ "&partitioner=my.package.MyPartitioner");
and in MyPartitioner I can get the partition info via the cluster parameter:
#Override
public int partition(String topic,
Object key, byte[] keyBytes, Object value, byte[] valueBytes,
Cluster cluster) {
List<PartitionInfo> partitions = cluster.availablePartitionsForTopic(topic);
// use partitions.size()...
return ...
}

Related

Is FLIP-140 still correct in how it describes sorting/spilling data?

FLIP-140 states:
We will introduce a sorting step (with potential spilling, reusing the UnilateralSortMerger implementation) before every keyed operator for sorting/grouping inputs by their keys. This will allow us to process records in per-key groups, which will enable us to use a simplified implementation of a StateBackend that is not organized in key groups and only ever keeps values for a single key.
The single key at a time execution will be used for the Batch style execution as decided by the algorithm described in FLIP-134: DataStream Semantics for Bounded Input .
Moreover it will be possible to disable it through a execution.sorted-shuffles.enabled configuration option.
However I see not documentation for execution.sorted-shuffles.enabled, and no references to it in the code. So is the above description of how things work still correct? Wondering how the "only keep one key's state around" would work without sorting.
This code makes me think that both the sorting and special state backend are being used with batch execution:
private void setBatchStateBackendAndTimerService(StreamGraph graph) {
boolean useStateBackend = configuration.get(ExecutionOptions.USE_BATCH_STATE_BACKEND);
boolean sortInputs = configuration.get(ExecutionOptions.SORT_INPUTS);
checkState(
!useStateBackend || sortInputs,
"Batch state backend requires the sorted inputs to be enabled!");
if (useStateBackend) {
LOG.debug("Using BATCH execution state backend and timer service.");
graph.setStateBackend(new BatchExecutionStateBackend());
graph.setChangelogStateBackendEnabled(TernaryBoolean.FALSE);
graph.setCheckpointStorage(new BatchExecutionCheckpointStorage());
graph.setTimerServiceProvider(
BatchExecutionInternalTimeServiceManager::create);
} else {
graph.setStateBackend(stateBackend);
graph.setChangelogStateBackendEnabled(changelogStateBackendEnabled);
}
}

Persist Apache Flink window

I'm trying to use Flink to consume a bounded data from a message queue in a streaming passion. The data will be in the following format:
{"id":-1,"name":"Start"}
{"id":1,"name":"Foo 1"}
{"id":2,"name":"Foo 2"}
{"id":3,"name":"Foo 3"}
{"id":4,"name":"Foo 4"}
{"id":5,"name":"Foo 5"}
...
{"id":-2,"name":"End"}
The start and end of messages can be determined using the event id. I want to receive such batches and store the latest (by overwriting) batch on disk or in memory. I can write a custom window trigger to extract the events using the start and end flags as shown below:
DataStream<Foo> fooDataStream = ...
AllWindowedStream<Foo, GlobalWindow> fooWindow = fooDataStream.windowAll(GlobalWindows.create())
.trigger(new CustomTrigger<>())
.evictor(new Evictor<Foo, GlobalWindow>() {
#Override
public void evictBefore(Iterable<TimestampedValue<Foo>> elements, int size, GlobalWindow window, EvictorContext evictorContext) {
for (Iterator<TimestampedValue<Foo>> iterator = elements.iterator();
iterator.hasNext(); ) {
TimestampedValue<Foo> foo = iterator.next();
if (foo.getValue().getId() < 0) {
iterator.remove();
}
}
}
#Override
public void evictAfter(Iterable<TimestampedValue<Foo>> elements, int size, GlobalWindow window, EvictorContext evictorContext) {
}
});
but how can I persist the output of the latest window. One way would be using a ProcessAllWindowFunction to receive all the events and write them to disk manually but it feels like a hack. I'm also looking into the Table API with Flink CEP Pattern (like this question) but couldn't find a way to clear the Table after each batch to discard the events from the previous batch.
There are a couple of things getting in the way of what you want:
(1) Flink's window operators produce append streams, rather than update streams. They're not designed to update previously emitted results. CEP also doesn't produce update streams.
(2) Flink's file system abstraction does not support overwriting files. This is because object stores, like S3, don't support this operation very well.
I think your options are:
(1) Rework your job so that it produces an update (changelog) stream. You can do this with toChangelogStream, or by using Table/SQL operations that create update streams, such as GROUP BY (when it's used without a time window). On top of this, you'll need to choose a sink that supports retractions/updates, such as a database.
(2) Stick to producing an append stream and use something like the FileSink to write the results to a series of rolling files. Then do some scripting outside of Flink to get what you want out of this.

flink assign uid to window function

is there a way to assign uid to a window function (such as apply(ApplyCustomFunction)) as we do for map/flatmap (or other) functions in Flink. The Flink version is 1.13.1.
I would like to specify the case with an example
DataStream<RECORD> outputDataStream = dataStream
.coGroup(otherDataStream)
.where(DATA::getKey)
.equalTo(OTHERDATA::getKey)
.window(TumblingProcessingTimeWindows.of(Time.seconds(2)))
.apply(new CoGroupFunction());
Thanks
CoGroupedStreams.WithWindow#apply(CoGroupFunction<T1,T2,T>) doesn't have the return type that's needed for setting a UID or per-operator parallelism (among other things). This was done in order to keep binary backwards compatibility, and can't be fixed before Flink 2.0.
You can work around this by using the (deprecated) with method instead of apply, as in
DataStream<RECORD> outputDataStream = dataStream
.coGroup(otherDataStream)
.where(DATA::getKey)
.equalTo(OTHERDATA::getKey)
.window(TumblingProcessingTimeWindows.of(Time.seconds(2)))
.with(new CoGroupFunction())
.uid("window");
The with method will be removed once it is no longer needed.
Use with() instead of apply(). It will be fixed in 2.0 version, how it sayed in documentation

How does Flink treat timestamps within iterative loops?

How are timestamps treated within an iterative DataStream loop within Flink?
For example, here is an example of a simple iterative loop within Flink where the feedback loop is of a different type to the input stream:
DataStream<MyInput> inputStream = env.addSource(new MyInputSourceFunction());
IterativeStream.ConnectedIterativeStreams<MyInput, MyFeedback> iterativeStream = inputStream.iterate().withFeedbackType(MyFeedback.class);
// define an output tag so we can emit feedback objects via a side output
final OutputTag<MyFeedback> outputTag = new OutputTag<MyFeedback>("feedback-output"){};
// now do some processing
SingleOutputStreamOperator<MyOutput> combinedStreams = iterativeStream.process(new CoProcessFunction<MyInput, MyFeedback, MyOutput>() {
#Override
public void processElement1(MyInput value, Context ctx, Collector<MyOutput> out) throws Exception {
// do some processing of the stream of MyInput values
// emit MyOutput values downstream by calling out.collect()
out.collect(someInstanceOfMyOutput);
}
#Override
public void processElement2(MyFeedback value, Context ctx, Collector<MyOutput> out) throws Exception {
// do some more processing on the feedback classes
// emit feedback items
ctx.output(outputTag, someInstanceOfMyFeedback);
}
});
iterativeStream.closeWith(combinedStreams.getSideOutput(outputTag));
My questions revolve around how does Flink use timestamps within a feedback loop:
Within the ConnectedIterativeStreams, how does Flink treat ordering of the input objects across the streams of regular inputs and feedback objects? If I emit an object into the feedback loop, when will it be seen by the head of the loop with respect to the regular stream of input objects?
How does the behaviour change when using event time processing?
AFAICT, Flink doesn't provide any guarantees on the ordering of input objects. I've run into this when trying to use iterations for a clustering algorithm in Flink, where the centroid updates don't get processed in a timely manner. The only solution I found was to essentially create a single (unioned) stream of the incoming events and the centroid updates, versus using a co-stream.
FYI there's this proposal to address some of the short-comings of iterations.

How to add suffix to final completed file generated by BucketingSink in Apache Flink?

I created some archive data files on HDFS with Apache Flink, the generated file name has pattern like part-{parallel-task}-{count} but what I expected should have ".gz" suffix which can be loaded directly by Apache Spark.
I can't find any API to add suffix to final completed file generated by BucketingSink in Apache Flink, but can only add suffix to InProgress, Pending and ValidLength state. Anyone can help? HDFS Connector & Java API
As far as I can see, there is no option to add a suffix using the default BucketingSink.
One option would be not to use checkpointing and to set the pending suffix to the desired suffix. But since checkpointing is desirable in most cases, this isn't optimal.
My solution was to create a BucketingSinkWithSuffix implementation which is almost an exact copy of the default BucketingSink. The only things which need to be changed is adding a member variable for the suffix which can be set in the constructor and to adjust the way the base path is created.
Here's my implementation for the constructor:
public BucketingSinkWithSuffix(String basePath, String suffix) {
this.basePath = basePath;
this.bucketer = new DateTimeBucketer<>();
this.writerTemplate = new StringWriter<>();
this.partSuffix = suffix;
}
And for generating the base path (lines 523 and 528):
partPath = new Path(bucketPath, partPrefix + "-" + subtaskIndex + "-" + bucketState.partCounter + partSuffix);

Resources