how to perform parallel processing of gcp pubsub messages in apache camel - apache-camel

I have this code below that takes message from pubsub source topic -> transform it as per a template -> then publish the transformed message to a target topic.
But to improve performance I need to do this task in parallel.That is i need to poll 500 messages,and then transform it in parallel and then publish them to the target topic.
From the camel gcp component documentation I believe maxMessagesPerPoll and concurrentConsumers parameter will do the job.Due to lack of documentation I am not sure how does it internally works.
I mean a) if I poll say 500 message ,will then it create 500 parallel route that will process the messages and publish it to the target topic b)what about ordering of the messages c) should I be looking at parallel processing EIPs as an alternative
etc.
The concept is not clear to me
Was go
// my route
private void addRouteToContext(final PubSub pubSub) throws Exception {
this.camelContext.addRoutes(new RouteBuilder() {
#Override
public void configure() throws Exception {
errorHandler(deadLetterChannel("google-pubsub:{{gcp_project_id}}:{{pubsub.dead.letter.topic}}")
.useOriginalMessage().onPrepareFailure(new FailureProcessor()));
/*
* from topic
*/
from("google-pubsub:{{gcp_project_id}}:" + pubSub.getFromSubscription() + "?"
+ "maxMessagesPerPoll={{consumer.maxMessagesPerPoll}}&"
+ "concurrentConsumers={{consumer.concurrentConsumers}}").
/*
* transform using the velocity
*/
to("velocity:" + pubSub.getToTemplate() + "?contentCache=true").
/*
* attach header to the transform message
*/
setHeader("Header ", simple("${date:now:yyyyMMdd}")).routeId(pubSub.getRouteId()).
/*
* log the transformed event
*/
log("${body}").
/*
* publish the transformed event to the target topic
*/
to("google-pubsub:{{gcp_project_id}}:" + pubSub.getToTopic());
}
});
}

a) if I poll say 500 message ,will then it create 500 parallel route that will process the messages and publish it to the target topic
No, Camel does not create 500 parallel threads in this case. As you suspect, the number of concurrent consumer threads is set with concurrentConsumers. So if you define 5 concurrentConsumers with a maxMessagesPerPoll of 500, every consumer will fetch up to 500 messages and process them one after the other in a single thread. That is, you have 5 messages processed in parallel.
what about ordering of the messages
As soon as you process messages in parallel, the order of messages is messed up. But this already happens with 1 Consumer when you got processing errors and they are detoured to your deadLetterChannel and reprocessed later.
should I be looking at parallel processing EIPs as an alternative
Only if the concurrentConsumers option is not sufficient.

When you mention the concurrentConsumers option(let's say concurrentConsumers=10), you are asking Camel to create a thread pool of 10 threads, and each of those 10 threads will pick up a different message from the pub-sub queue and process them.
The thing to note here is that when you are specifying the concurrentConsumers option, the thread pool uses a fixed size, which means that a fixed number of active threads are waiting at all times to process incoming messages. So 10 threads(since I specified concurrentConsumers=10) will be waiting to process my messages, even if there aren't 10 messages coming in simultaneously.
Obviously, this is not going to guarantee that the incoming messages will be processed in the same order. If you are looking to have the messages in the same order, you can have a look at the Resequencer EIP to order your messages.
As for your third question, I don't think google-pubsub component allows a parallel processing option. You can make your own using the Threads EIP. This would definitely give more control over your concurrency.
Using Threads, your code would look something like this:
from("google-pubsub:project-id:destinationName?maxMessagesPerPoll=20")
// the 2 parameters are 'pool size' and 'max pool size'
.threads(5, 20)
.to("direct:out");

Related

How to use multithreading inside an JPA camel route

We have an importer running on a powerful, multi-core server. However, our Apache Camel routes are single threaded, which is a shame.
Our [camel] importer is a single-instance program. How can I make a specific route process the messages using multiple threads? The messages are atomic and are processed by a bean, which already does this in a thread-safe way.
I am already happy if I could process batches (maxMessagesPerPoll) in threads and have idle time until the next poll takes place (after all, that's still better than sequential processing).
Here is one of the routes I would like to make multithreaded:
public void onConfigure() throws Exception {
// This is a JPA query which selects all unprocessed modules
String query = RouteQueryHelper.selectNextUnprocessedStaged(ImportAction.IMPORT_MODULES);
from("jpa:com.so.importer.entity.ModuleStageEntity" +
"?consumer.query=" + query +
"&maxMessagesPerPoll=2000" +
"&consumeLockEntity=false" +
"&consumer.delay=1000" +
"&consumeDelete=false")
.transacted().policy("CAMEL_DEFAULT_POLICY")
.bean(moduleImportService) // processes the entity and updates it's status flag
.to("log:import-module?groupInterval=10000")
.routeId("so.route.import-module");
}
The route has consumeDelete=false, because we use a status property on the entity instead (which is modified and saved). The status property is also respected in the consumer.query.
We use camel version 2.17.1 in spring boot (1.3.8.RELEASE) on Java 8.
EDIT 2019-Jan-21: The entities have a method with #Consumed on them, which pushes the entity into the next route after it was processed:
#Consumed
public void gotoNextStatus() {
switch (stageStatus) {
case STAGED: setStageStatus(StageStatus.IMPORTED); break;
case IMPORTED: setStageStatus(StageStatus.RENDERED); break;
case RENDERED: setStageStatus(StageStatus.DONE); break;
}
}
You could introduce some asynchronisation by sending your messages to an intermediate SEDA endpoint:
from("jpa:")
...
.to("seda:intermediateStage")
And then put the real processing inside a new route with N concurrrent SEDA consumers (default is one):
from("seda:intermediateStage?concurrentConsumers=5")
.process(...)

How to schedule JMS consuming in Apache Camel?

I need to consume JMS messages with Camel everyday at 9pm (or from 9pm to 10pm to give it the time to consume all the messages).
I can't see any "scheduler" option for URIs "cMQConnectionFactory:queue:myQueue" while it exists for "file://" or "ftp://" URIs.
If I put a cTimer before it will send an empty message to the queue, not schedule the consumer.
You can use a route policy where you can setup for example a cron expression to tell when the route is started and when its stopped.
http://camel.apache.org/scheduledroutepolicy.html
Other alternatives is to start/stop the route via the Java API or JMX etc and have some other logic that knows when to do that according to the clock.
This is something that has caused me a significant amount of trouble. There are a number of ways of skinning this cat, and none of them are great as far as I can see.
On is to set the route not to start automatically, and use a schedule to start the route and then stop it again after a short time using the controlbus EIP. http://camel.apache.org/controlbus.html
I didn't like this approach because I didn't trust that it would drain the queue completely once and only once per trigger.
Another is to use a pollEnrich to query the queue, but that only seems to pick up one item from the queue, but I wanted to completely drain it (only once).
I wrote a custom bean that uses consumer and producer templates to read all the entries in a queue with a specified time-out.
I found an example on the internet somewhere, but it took me a long time to find, and quickly searching again I can't find it now.
So what I have is:
from("timer:myTimer...")
.beanRef( "myConsumerBean", "pollConsumer" )
from("direct:myProcessingRoute")
.to("whatever");
And a simple pollConsumer method:
public void pollConsumer() throws Exception {
if ( consumerEndpoint == null ) consumerEndpoint = consumer.getCamelContext().getEndpoint( endpointUri );
consumer.start();
producer.start();
while ( true ) {
Exchange exchange = consumer.receive( consumerEndpoint, 1000 );
if ( exchange == null ) break;
producer.send( exchange );
consumer.doneUoW( exchange );
}
producer.stop();
consumer.stop();
}
where the producer is a DefaultProducerTemplate, consumer is a DefaultConsumerTemplate, and these are configured in the bean configuration.
This seems to work for me, but if anyone gives you a better answer I'll be very interested to see how it can be done better.

Apache Camel ProducerTemplate

Hi I am using apache camel + Spring and defined a configure like
public class MyOrderConsumerRouterBuilder extends RouteBuilder implements InitializingBean, ApplicationContextAware{
#Override
public void configure() throws Exception {
from("seda:asyncChannel?concurrentConsumers=20").id("asyncProcessChannelFromId")
.to("bean:OrderProcessManager?method=processOrders").id("asyncProcessChannelToId");
}
}
Is this Producer multithread? I see that consumers are multiple. In my case it is : concurrentConsumers=20
I checked below URL
How do I configure the default maximum cache size for ProducerCache or ProducerTemplate
As per source code DefaultCamelContext.createProducerTemplate() DefaultCamelContext DefaultProducerTemplate is being created with maximumCacheSize (default 1000)
As per this I understand this there can be multiple producers which are being defined using maximumCacheSize as LRU. In my case I have only one endpoint i.e SEDA so there will be only one producer.
So I think there will always be one single threaded producer. Please help me to understand it better.
The producer is not multithreaded, but there are multiple producers.
In your case 20 consumers (threads) are waiting for messages. If a message arrives, it is processed according to the route definition by one of these threads.
If another message arrives the thread who processes the first message is probably still occupied, but one of the other 19 free threads can process the message.
As long as there are no Splitters, Aggregators and similar EIPs, a single thread "walks" the message through your route and in your case finally sends the message to the OrderProcessManager bean. So this producing step (calling the bean method) is obviously done by a single thread for a single message.
BUT since you can have up to 20 threads processing messages in parallel, the OrderProcessManager bean can be called by up to 20 producers (threads) in parallel.

Job Queue using Google PubSub

I want to have a simple task queue. There will be multiple consumers running on different machines, but I only want each task to be consumed once.
If I have multiple subscribers taking messages from a topic using the same subscription ID is there a chance that the message will be read twice?
I've tested something along these lines successfully but I'm concerned that there could be synchronization issues.
client = SubscriberClient.create(SubscriberSettings.defaultBuilder().build());
subName = SubscriptionName.create(projectId, "Queue");
client.createSubscription(subName, topicName, PushConfig.getDefaultInstance(), 0);
Thread subscriber = new Thread() {
public void run() {
while (!interrupted()) {
PullResponse response = subscriberClient.pull(subscriptionName, false, 1);
List<ReceivedMessage> messages = response.getReceivedMessagesList();
mess = messasges.get(0);
client.acknowledge(subscriptionName, ImmutableList.of(mess.getAckId()));
doSomethingWith(mess.getMessage().getData().toStringUtf8());
}
}
};
subscriber.start();
In short, yes there is a chance that some messages will be duplicated: GCP promises at-least-once delivery. Exactly-once-delivery is theoretically impossible in any distributed system. You should design your doSomethingWith code to be idempotent if possible so duplicate messages are not a problem.
You should also only acknowledge a message once you have finished processing it: what would happen if your machine dies after acknowledge but before doSomethingWith returns? your message will be lost! (this fundamental idea is why exactly-once delivery is impossible).
If losing messages is preferable to double processing them, you could add a locking process (write a "processed" token to a consistent database), but this can fail if the write is handled before the message is processed. But at this point you might be able to find a messaging technology that is designed for at-most-once, rather than optimised for reliability.

How can I improve the performance of a Seda queue?

Take this example:
from("seda:data").log("data added to queue")
.setHeader("CamelHttpMethod", constant("POST"))
.setHeader(Exchange.CONTENT_TYPE, constant("application/json"))
.process(new Processor() {
public void process(Exchange exchange) throws Exception {
exchange.setProperty(Exchange.CHARSET_NAME, "UTF-8");
}
})
.recipientList(header(RECIPIENT_LIST))
.ignoreInvalidEndpoints().parallelProcessing();
Assume the RECIPENT_LIST header contains only one http endpoint. For a given http endpoint, messages should be processed in order, but two messages for different end points can be processed in parallel.
Basically, I want to know if there is anything be done to improve performance. For example, would using concurrentConsumers help?
SEDA with concurrentConsumers > 1 would absolutely help with throughput because it would allow multiple threads to run in parallel...but you'll need to implement your own locking mechanism to make sure only a single thread is hitting a given http endpoint at a given time
otherwise, here is an overview of your options: http://camel.apache.org/parallel-processing-and-ordering.html
in short, if you can use JMS, then consider using ActiveMQ message groups as its trivial to use and is designed for exactly this use case (parallel processing, but single threaded by groups of messages, etc).

Resources