Using the Threading model I have increased the amount of processing records from the DB in a service which has been implemented using Apache camel.
Now, I have to do the same but this time I'm getting the records from a queue. I'm a little vague but I'm thinking of using the Threading model so that I can process more than 1 record. My questions here are
Since earlier in case of processing records from DB I used to query around 10 records and have it processed by the number of threads available , in case of queue I'm not sure how this is going to work.
What are the other things I need to consider or know if the above mentioned is doable.
You can use ActiveMQ component's option "concurrentConsumers" and give the value as to how many concurrent consumers you want on your camel route.
For example
from("jms:MyQueue?concurrentConsumers=5")
.to("bean:someBean");
or in XML DSL
<route>
<from uri="jms:MyQueue?concurrentConsumers=5"/>
<to uri="bean:someBean"/>
</route>
More details on activemq query options you can see here.
See this also to unerstand more on competing consumers
Related
We have developed Apache Camel, SpringBoot based application to read data from Oracle table and do some transformation of record and publish to Kafka topic. We are using Camel SQL component for DB integration and implemented Split & parallel processing pattern to parallelize processing to achieve high throughput. Oracle tables are partitioned, so we are creating multiple routes, one route per table partition, to speed up the processing. We have 30 partitions in table and so created 30 routes.
from(buildSelectSqlEndpoint(routeId))
.process(new GenericEventProcessor("MessagesReceived"))
.split(body())
.parallelProcessing()
.process(new GenericEventProcessor("SplitAndParallelProcessing"))
.process(new InventoryProcessor())
.process(new GenericEventProcessor("ConvertedToAvroFormat"))
.to(buildKafkaProducerEndPoint(routeId));
I tested the application in local laptop with an adequate load and it is processing as expected. But when we deployed the application in PCF, we see some of the threads are failing. I have enabled the Camel debug log and i see below debug log line -
Pipeline - Message exchange has failed: so breaking out of pipeline
for exchange
Due to these thousands of messages are not published to Kafka.
From initial analysis, i figured out Camel is creating one thread for each route and based on maxMessagePerPoll configuration, it creates number of threads equal to maxMessagePerPoll. With Split & parallel processing enabled, Camel creates additional threads equal to maxMessagePerPoll for Kafka publish. with this approach, we will have hundreds of threads created and not sure any issue with PCF. Also I tried number of routes with 2, to check the message delivery failures, and still see hundreds of failures and with only 2 routes, increases total processing time for millions of records.
Could you please let me know, how to use Apache Camel with containers like PCF? any additional configurations we need to have in PCF or Camel ?
Can I filter messages so only one with a given correlation expression is forwarded?
I have a stream of messages from different devices. I want to keep an SQL table with all devices already encountered.
Trivial way would be to route all messages to an sql component with an insert statement. But this would create unnecessary load on the DB because devices send with a high frequency.
My current solution is to have a java predicate that returns true the first time the device id is encountered since last restart.
This works, but I would like to see if I can replace this with camel on-board methods - potentially making the route easier to understand.
Is there some way to use aggregation to only pass the first message with a given correlation value?
There is the Camel idempotent consumer that does exactly this.
With the help of a repository of already processed messages it drops any further message with the same identification characteristics.
This is very handy wherever you have at-least-once semantics on message delivery.
We are investigating Camel for use in a new system; one of our situations is that a set of steps is started, and some of the steps in the set can take hours or days to execute. We need to execute other steps after such long-elapsed-time steps are finished.
We also need to have the system survive reboot while some of the "routes" with these steps are in progress, i.e., the system might be rebooted, and we need the "routes" in progress to remain in their state and pick up where they left off.
I know we can use a queued messaging system such as JMS, and that messages put into such a queue can be handled to be persisted. What isn't clear to me is how (or whether) that fits into Camel -- would we need to treat the step(s) after each queue as its own route, so that it could, on startup, read from the queue? That's going to split up our processing steps into many more 'routes' than we would have otherwise, but maybe that's the way it's done.
Is/are there Camel construct/constructs which assist with this kind of system? If I know their terms and basic outline, I can probably figure it out, but what I really need is an explanation of what the constructs do.
Camel is not a human workflow / long-lasting tasks system. For that kind look at BPMS systems. Camel is more fitting for real time / near real time integrations.
For long tasks you persist their state in some external system like a message broker or database or BPMS, and then you can use Camel routes to process and move from one state to the next - or where Camel fit in such as integrating with the many different systems you can do OOTB with the 200+ Camel components.
Camel do provide graceful shutdown so you can safely shutdown or reboot Camel. But in the unlikely event of a crash, you may want to look at transactions and idempotency if you are talking about surviving a system crash.
You are referring to asynchronous processing of messages in routes. Camel has a couple of components that you can use to achieve this.
SEDA: In memory not persistent and can only call end points in the same route.
VM: In memory not persistent and can call endpoints in different camel contexts but limited to the same JVM. This component is a extension of SEDA.
JMS: Persistence is configurable on the queue stack. Much more heavy weight but also more fault tolerant than SEDA/JVM.
SEDA/JVM can be used as low overhead replacements for JMS components and in some cases I would use them exclusively. In your case the persistence element is a required so SEDA/JVM is not an option, but to keep things simple the examples will use SEDA as you can get some basics up and running quickly.
The example will assume the following we have a timer that kicks off and then there is two processes it needs to run. See screenshot below:
In this route the message flow is synchronous between the timer and the two process beans.
If we wanted to make these steps asynchronous we would need to break each step into a route of its own. We would then connect these routes using one of the components listed in the beginning. See the screenshot below:
Notice we have three routes each route only has one "processing" step in it. Route Test only has a timer which fires a messages to the SEDA queue called processOne. This message is received on the SEDA queue and sent to the Process_One bean. After this it is the send to the SEDA queue called processTwo, where it is received and passed to the Process_Two bean. All this is done asynchronously.
You can replace the SEDA components with JMS once you get to understand the concepts. I suspect that state tracking is going to be the most complicated part as Camel makes the asynchronous part easy.
Apache Camel Kafka Consumer provides URI options called "consumerStreams" and "consumersCount".
Need to understand the difference and usage scenarios and how it will fit with multi-partition Kafka topic message consumption
consumerCount controls how many consumer instances are created by the camel endpoint. So if you have 3 partitions in your topic and you have a consumerCount of 3 then you can consume 3 messages (1 per partition) at a time. This setting does exactly what you would expect from the documentation
consumerStreams is a totally different setting and has imho a misleading name AND a misleading documentation.
Currently the documentation (https://github.com/apache/camel/blob/master/components/camel-kafka/src/main/docs/kafka-component.adoc) says:
consumerStreams: Number of concurrent consumers on the consumer
But the source code reveals its real purpose:
consumerStreams configures how many Threads are available for all consumers to be run on. Internally the Kafka endpoint creates one Runnable per consumer. (consumerCount = 3) means 3 Runnables. These runnables are executed by an ThreadPoolExecutorService which is scaled by the consumerStreams setting.
Since the single consumer threads are long running tasks the only purpose of consumerStreams can be to handle reconnection or blocked threads. A higher value for consumerStreams does not lead to more parallelization. And it should better be named consumerThreadPoolSize or something like that.
I have checked Camel Kafka source code, it seems there is a different use of these parameters overtime.
consumerStreams were used in old versions of the Camel-Kafka component such as 2.13 as you can see here
consumersCount is used in latest versions of the Camel-Kafka component (see here) and it represents the number of org.apache.kafka.clients.consumer.KafkaConsumer that will be instantiated, so you should really use this for multi-partition consumption
it seems they were used together in camel 2.16
I'm currently developing a Camel Integration app in which resumption from a previous state of processing is important. When there's a power outage, for instance, it's important that all previously processed messages are not re-processed. The processing should resume from where it left off before the outage.
I've gone through a number of possible solutions including Terracotta and Apache Shiro. I'm not sure how to use either as documentation on the integration with Apache Camel is scarce. I've not settled on the two, however.
I'm looking for suggestions on the potential alternatives I can use or a pointer to some tutorial to get me started.
The difficulty in surviving outages lies primarily in state, and what to do with in-flight messages.
Usually, when you're talking state within routes the solution is to flush it to disk, or other nodes in the cluster. Taking the aggregator pattern as an example, aggregated state is persisted in an aggregation repository. The default implementation is in memory, so if the power goes out, all the state is lost. However, there are other implementations, including one for JDBC, and another using Hazelcast (a lightweight in-memory data grid). I haven't used Hazelcast myself, but JDBC does a synchronous write to disk. The aggregator pattern allows you to resume from where you left off. A similar solution exists for idempotent consumption.
The second question, around in-flight messages is a little more complicated, and largely depends on where you are consuming from. If you're in the middle of handling a web service request, and the power goes out, does it matter if you have lost the message? The user can simply retry. Any effects on external systems can be wrapped in a transaction, or an idempotent consumer with JDBC idempotent repository.
If you are building out integrations based on messaging, you should consume within a transaction, so that if your server goes down, the messages go back into the broker and can be replayed to another consumer.
Be careful when using seda: or threads blocks, these use an in-memory queue to pass exchanges between threads, any messages flowing down these sorts of routes will be lost if someone trips over the power cable. If you can't afford message loss, and need this sort of processing model, consider using a JMS queue as the endpoints between the two routes (with transactions to ensure you pick up where you left off).