How to aggregate in Camel using the Akka 1.3 integration? - apache-camel

The Akka integration is super nice, but I cannot seem to find any examples on how to aggregate using the Producer actors. My producer is super simple:
class BindingCandidateProducer(config: Configuration)
extends Actor
with Producer
with Oneway
with Logging
with Instrumented {
import BindingCandidateJsonProtocol._
def endpointUri = "file:data/bindings?fileName=bindings.${date:now:yyyy-MM-dd'T'HHmm}.mjson"
override protected def receiveBeforeProduce = {
case bindingCandidate: BindingCandidate => bindingCandidate.toJson.compactPrint
}
}
NOTE: mjson is the internal name for "multi json", a file format where each line is a complete JSON message.
I'm trying to aggregate multiple BindingCandidate objects into a single file. How and where do I specify my aggregator? Is it a separate actor that lives before this one? There is no information about aggregators on the Akka 1.3 Camel documentation. The Akka 2 documentation has no reference to Camel, although the code's still there. The Akka forum has a single thread about camel aggregation.
I'm still on Akka 1.3, Scala 2.9, but using Camel 2.12.2.

Related

Flink 1.12.x DataSet --> Flink 1.14.x DataStream

I am trying to migrate from Flink 1.12.x DataSet api to Flink 1.14.x DataStream api. mapPartition is not available in Flink DataStream.
Our Code using Flink 1.12.x DataSet
dataset
.<few operations>
.mapPartition(new SomeMapParitionFn())
.<few more operations>
public static class SomeMapPartitionFn extends RichMapPartitionFunction<InputModel, OutputModel> {
#Override
public void mapPartition(Iterable<InputModel> records, Collector<OutputModel> out) throws Exception {
for (InputModel record : records) {
/*
do some operation
*/
if (/* some condition based on processing *MULTIPLE* records */) {
out.collect(...); // Conditional collect ---> (1)
}
}
// At the end of the data, collect
out.collect(...); // Collect processed data ---> (2)
}
}
(1) - Collector.collect invoked based on some condition after processing few records
(2) - Collector.collect invoked at the end of data
Initially we thought of using flatMap instead of mapPartition, but collector not available in close function.
https://issues.apache.org/jira/browse/FLINK-14709 - Only available in case of chained drivers
How to implement this in Flink 1.14.x DataStream? Please advise...
Note: Our application works with only finite set of data (Batch Mode)
In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.
In order to implement similar function, you need to define a window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility
Solution provided by Zhipeng
One solution could be using a streamOperator to implement BoundedOneInput
interface.
An example code could be found here [1].
[1]
https://github.com/apache/flink-ml/blob/56b441d85c3356c0ffedeef9c27969aee5b3ecfc/flink-ml-core/src/main/java/org/apache/flink/ml/common/datastream/DataStreamUtils.java#L75
Flink user mailing link: https://lists.apache.org/thread/ktck2y96d0q1odnjjkfks0dmrwh7kb3z

how to achieve exactly once semantics in apache kafka connector

I am using flink version 1.8.0 . My application reads data from kafka -> transform -> publish to Kafka. To avoid any duplicates during restart, i want to use kafka producer with Exactly once semantics , read about it here :
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/connectors/kafka.html#kafka-011-and-newer
My kafka version is 1.1 .
return new FlinkKafkaProducer<String>( topic, new KeyedSerializationSchema<String>() {
public byte[] serializeKey(String element) {
// TODO Auto-generated method stub
return element.getBytes();
}
public byte[] serializeValue(String element) {
// TODO Auto-generated method stub
return element.getBytes();
}
public String getTargetTopic(String element) {
// TODO Auto-generated method stub
return topic;
}
},prop, opt, FlinkKafkaProducer.Semantic.EXACTLY_ONCE, 1);
Checkpoint Code :
CheckpointConfig checkpointConfig = env.getCheckpointConfig();
checkpointConfig.setCheckpointTimeout(15 * 1000 );
checkpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.enableCheckpointing(5000 );
If I add exactly once sematics in kafka producer , my flink consumer is not reading any new data.
Can any one please share any sample code/application with Exactly once Semantics ?
Please find complete code here :
https://github.com/sris2/sample_flink_exactly_once
Thanks
Can any one please share any sample code/application with Exactly once Semantics ?
An exactly once example is hidden in an end-to-end test in flink. Since it uses some convenience functions, it may be hard to follow without checking out the whole repo.
If I add exactly once sematics in kafka producer , my flink consumer
is not reading any new data.
[...]
Please find complete code here :
https://github.com/sris2/sample_flink_exactly_once
I checked out your code and found the issue (had to fix the whole setup/code to actually get it running). The sink can actually not configure the transactions correctly. As written in the Flink Kafka connector documentation, you need to adjust the transaction.timeout.ms either in your Kafka broker up to 1 hour or in your application down to 15 min:
prop.setProperty("transaction.timeout.ms", "900000");
The respective excerpt is:
Kafka brokers by default have transaction.max.timeout.ms set to 15 minutes. This property will not allow to set transaction timeouts for the producers larger than it’s value. FlinkKafkaProducer011 by default sets the transaction.timeout.ms property in producer config to 1 hour, thus transaction.max.timeout.ms should be increased before using the Semantic.EXACTLY_ONCE mode.

Apache Flink - kafka producer to sink messages to kafka topics but on different partitions

Right now my flink code is processing a file and sinking the data on kafka topic with 1 partition.
Now I have a topic with 2 partition and I want flink code to sink data on those 2 partition using DefaultPartitioner.
Could you help me with that.
Here is the code snippet of my current code:
DataStream<String> speStream = inputStream..map(new MapFunction<Row, String>(){....}
Properties props = Producer.getProducerConfig(propertiesFilePath);
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName, new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()), props, FlinkKafkaProducer011.Semantic.EXACTLY_ONCE));
Solved this by changing the flinkproducer to
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName,new SimpleStringSchema(),
props));
earlier i was using
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName,
new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()), props,
FlinkKafkaProducer011.Semantic.EXACTLY_ONCE));
In Flink version 1.11 (which I'm using with Java), the SimpleStringSchema needs a wrapper (ie. KeyedSerializationSchemaWrapper) which is also used by #Ankit but removed from the suggested solution as I was getting below constructor related error due to the same.
FlinkKafkaProducer<String> producer = new FlinkKafkaProducer<String>(
topic_name, new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()),
properties, FlinkKafkaProducer.Semantic.EXACTLY_ONCE);
Error:
The constructor FlinkKafkaProducer<String>(String, SimpleStringSchema, Properties, FlinkKafkaProducer.Semantic) is undefined

Camel REST (restlet) URL - confusing with path params

I'm using Camel Rest (with restlet component) and I have the following APIs:
rest("/order")
.get("/{orderId}").produces("application/json")
.param().dataType("int").type(RestParamType.path).name("orderId").endParam()
.route()
.bean(OrderService.class, "findOrderById")
.endRest()
.get("/customer").produces("application/json").route()
.bean(OrderService.class, "findOrderByCustomerId")
.endRest()
The problem is that the /order/customer doesn't works (see Exception below). The parameters for /customer comes from JWT...
java.lang.String to the required type: java.lang.Long with value
customer due Illegal characters: customer
I think that camel is confusing the ../{orderId} parameter with .../customer.
If I change the /customer for /customer/orders it's works.
The same idea in Spring Boot could have done with:
#RequestMapping("/order/{orderId}")
public Order getOrder(#PathVariable Long orderId) {
return orderRepo.findOne(orderId);
}
#RequestMapping("/order/customer")
public List<Order> getOrder() {
return orderRepo.listOrderByCustomer(1l);
}
Any idea about what's happening?
Try changing the order of your GET operations in the Camel Rest DSL. The restlet component has some issues in matching the best possible methods.
There is a couple of JIRA tickets related to this:
https://issues.apache.org/jira/browse/CAMEL-12320
https://issues.apache.org/jira/browse/CAMEL-7906

Apache Camel splitter with hazelcast seda queue

I'm trying to do a file import process where a file is picked up in a subdirectory of a given folder, the subdirectory identifying the client the file is for, then the records are parsed, split, and sent on Hazelcast SEDA queues. I want to process each record as its read off of the Hazelcast SEDA queue, then it returns a status code (created, updated, or errored) which can be aggregated.
I'm also creating a job record when the file is first picked up and I want to update the job record with the final count of created, updated, and errors.
The JobProcessor below creates this record and sets the client Organization and Job objects in headers on the message. The CensusExcelDataFormat reads an Excel file and creates an Employee object for each line, then returns a Collection.
from("file:" + censusDirectory + "?recursive=true").idempotentConsumer(new SimpleExpression("file:name"), idempotentRepository)
.process(new JobProcessor(organizationService, jobService, Job.JobType.CENSUS))
.unmarshal(censusExcelDataFormat)
.split(body(), new ListAggregationStrategy()).parallelProcessing()
.to(ExchangePattern.InOut, "hazelcast:seda:process-employee-import").end()
.process(new JobCompletionProcessor(jobService))
.end();
from("hazelcast:seda:process-employee-import")
.idempotentConsumer(simple("${body.entityId}"), idempotentRepository)
.bean(employeeImporterJob, "importOrUpdate");
The problem I'm having is that the list aggregation happens immediately and instead of getting a list of statuses I'm getting the same list of Employee objects. I want the Employee objects to be sent on the SEDA queue and the return value from the processing on the queue to be aggregated then run through the JobCompletionProcessor to update the Job record.
The behaviour is you are seeing is the default behavior. The apache camel splitter documentation clearly states this in the what the splitter returns section.
Camel 2.2 or older: The Splitter will by default return the last
splitted message.
Camel 2.3 and newer: The Splitter will by default return the
original input message.
For all versions: You can override this by supplying your own
strategy as an AggregationStrategy. There is a sample on this page
(Split aggregate request/reply sample). Notice it's the same
strategy as the Aggregator supports. This Splitter can be viewed as
having a build in light weight Aggregator.
So as you can see you are required to implement your own splitter aggregation strategy. To do this create a new class that implements AggrgationStrategy something like the code below:
public class MyAggregationStrategy implements AggregationStrategy
{
public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
if (oldExchange == null) //this would be null on the first exchange.
{
//do some work on the first time if needed
}
/*
Here you put your code to calculate failed, updated, created.
*/
}
}
You can then use your custom aggregation strategy by specifying it like the following examples:
.split(body(), new MyAggregationStrategy()) //Java DSL
<split strategyRef="myAggregationStrategy"/> //XML Blueprint

Resources