Let's say I have a Job Scheduler which has 4 consumers A, B, C and D. Jobs of type X will have to be routed to Consumer A, type Y to Consumer B and so on. Consumers A, B, C and D are to run as independent applications without any dependency, either locally or remotely.
The consumers take varying times to complete their jobs, which are subsequently routed to the Job Scheduler for aggregation.
Clones of one of the consumers may also be needed to share its eligible jobs. A job should however be processed only once.
Is Content-based router the best solution for this? Mind you, I need the custom job scheduler, because it only has the intelligence to split up a job among the consumers.
Or is there any better way to handle this? I don't require those features of the broker like automatically switching over to another consumer (load balancing) and such in case of a failure.
I'm not completly sure that I follow you. This sounds like a rather straight forward scenario for asychronous processing.
I'm not sure how you plan to send these jobs to the Camel application, but given you can receive them somewhere you could probably go ahead with a simple content based router.
Given your requirements for the consumers, I would go for JMS queues (using Apache ActiveMQ or similar broker middleware), one queue per job type. This makes it easy to distribute consumers to different machines without really changing the code.
// Central node routes
from("xxx:routeJob")
.choice()
.when(header("type").isEqualTo("x"))
.to("jms:queue:processJobTypeX")
.when(header("type").isEqualTo("y"))
.to("jms:queue:processJobTypeY")
.otherwise()
.to("jms:queue:processJobTypeZ");
from("jms:queue:aggregateJob")
.bean(aggregate);
// different camel application (may be duplicated for multiple instances).
from("jms:queue:processJobTypeX")
.bean(heavyProcessing)
.to("jms:queue:aggregateJob");
// Yet another camel application
from("jms:queue:processJobTypeY")
.bean(lightProcessing)
.to("jms:queue:aggregateJob");
Please revisit your question for a better answer :)
Related
Is it possible in Apache Flink, to create an application, which consists of multiple jobs who build a pipeline to process some data.
For example, consider a process with an input/preprocessing stage, a business logic and an output stage.
In order to be flexible in development and (re)deployment, I would like to run these as independent jobs.
Is it possible in Flink to built this and directly pipe the output of one job to the input of another (without external components)?
If yes, where can I find documentation about this and can it buffer data if one of the jobs is restarted?
If no, does anyone have experience with such a setup and point me to a possible solution?
Thank you!
If you really want separate jobs, then one way to connect them is via something like Kafka, where job A publishes, and job B (downstream) subscribes. Once you disconnect the two jobs, though, you no longer get the benefit of backpressure or unified checkpointing/saved state.
Kafka can do buffering of course (up to some max amount of data), but that's not a solution to a persistent different in performance, if the upstream job is generating data faster than the downstream job can consume it.
I imagine you could also use files as the 'bridge' between jobs (streaming file sink and then streaming file source), though that would typically create significant latency as the downstream job has to wait for the upstream job to decide to complete a file, before it can be consumed.
An alternative approach that's been successfully used a number of times is to provide the details of the preprocessing and business logic stages dynamically, rather than compiling them into the application. This means that the overall topology of the job graph is static, but you are able to modify the processing logic while the job is running.
I've seen this done with purpose-built DSLs, PMML models, Javascript (via Rhino), Groovy, Java classloading, ...
You can use a broadcast stream to communicate/update the dynamic portions of the processing.
Here's an example of this pattern, described in a Flink Forward talk by Erik de Nooij from ING Bank.
We are investigating Camel for use in a new system; one of our situations is that a set of steps is started, and some of the steps in the set can take hours or days to execute. We need to execute other steps after such long-elapsed-time steps are finished.
We also need to have the system survive reboot while some of the "routes" with these steps are in progress, i.e., the system might be rebooted, and we need the "routes" in progress to remain in their state and pick up where they left off.
I know we can use a queued messaging system such as JMS, and that messages put into such a queue can be handled to be persisted. What isn't clear to me is how (or whether) that fits into Camel -- would we need to treat the step(s) after each queue as its own route, so that it could, on startup, read from the queue? That's going to split up our processing steps into many more 'routes' than we would have otherwise, but maybe that's the way it's done.
Is/are there Camel construct/constructs which assist with this kind of system? If I know their terms and basic outline, I can probably figure it out, but what I really need is an explanation of what the constructs do.
Camel is not a human workflow / long-lasting tasks system. For that kind look at BPMS systems. Camel is more fitting for real time / near real time integrations.
For long tasks you persist their state in some external system like a message broker or database or BPMS, and then you can use Camel routes to process and move from one state to the next - or where Camel fit in such as integrating with the many different systems you can do OOTB with the 200+ Camel components.
Camel do provide graceful shutdown so you can safely shutdown or reboot Camel. But in the unlikely event of a crash, you may want to look at transactions and idempotency if you are talking about surviving a system crash.
You are referring to asynchronous processing of messages in routes. Camel has a couple of components that you can use to achieve this.
SEDA: In memory not persistent and can only call end points in the same route.
VM: In memory not persistent and can call endpoints in different camel contexts but limited to the same JVM. This component is a extension of SEDA.
JMS: Persistence is configurable on the queue stack. Much more heavy weight but also more fault tolerant than SEDA/JVM.
SEDA/JVM can be used as low overhead replacements for JMS components and in some cases I would use them exclusively. In your case the persistence element is a required so SEDA/JVM is not an option, but to keep things simple the examples will use SEDA as you can get some basics up and running quickly.
The example will assume the following we have a timer that kicks off and then there is two processes it needs to run. See screenshot below:
In this route the message flow is synchronous between the timer and the two process beans.
If we wanted to make these steps asynchronous we would need to break each step into a route of its own. We would then connect these routes using one of the components listed in the beginning. See the screenshot below:
Notice we have three routes each route only has one "processing" step in it. Route Test only has a timer which fires a messages to the SEDA queue called processOne. This message is received on the SEDA queue and sent to the Process_One bean. After this it is the send to the SEDA queue called processTwo, where it is received and passed to the Process_Two bean. All this is done asynchronously.
You can replace the SEDA components with JMS once you get to understand the concepts. I suspect that state tracking is going to be the most complicated part as Camel makes the asynchronous part easy.
We are using camel-cxf as consumer (soap) in our project and asked ourself if camel-cxf uses multiple threads to react on requests.
We think it uses multiple threads, right?!
But what does this mean for the rest of the route? Is all multithreaded after "from" or is there a point of synchronization?
And what does this mean for "parallelProcessing" or "threads"?
In our case we use jdbc component later in the route. IS camel-jdbc also using multiple threads?
How to know in general what threading model is used by a given component?
Let's start with your last question:
How to know in general what threading model is used by a given
component?
You are probably asking which component is single-threaded by default and which ones are multi-threaded?
You need to ask yourself which approach makes most sense for a component and read the component's documentation. Normally the flags will tell you what behavior is applied by default. CXF is a component that requires a web server, jetty in this case, for a SOAP (over HTTP) client to be able to call the service. HTTP is a stateless protocol, a web server has to scale to many clients, thus it makes a lot of sense for a web server to be multi-threaded. So yes, two simultanious requests to a CXF endpoint are handled by two separate (jetty) threads. The route starting at the CXF endpoint is executed simultaniously by the jetty threads that received the request.
On the contrary, if you are polling for file system changes, e.g. you want to check if a certain file was created, it makes no sense to apply multiple threads to the task of polling. Thus the file consumer is single threaded. The thread employed by the file consumer to do the polling will also execute your route that processes the file(s) that were picked up during a poll.
If processing the files identified by a poll takes a long time compared to your polling intervall, and you cannot afford to miss a poll, then you need to hand of the processing of the rest of the route to another thread so your polling thread is again free to do, well, polling. Enter the Threads DSL.
Then you have processors like the splitter that create many tasks from a single task. To make the splitter work for everyone it must be assumed that the tasks created by the splitter cannot be performed out of order and/or fully independent of each other. So the safe default is to run the steps wrapped by the split step in the thread that executes the route as a whole. But if you the route author knows that the individual split items can be processed independent of each other, then you can parallelize the processing of the steps wrapped by the split step by setting parallelProcessing="true".
Both the threads DSL and the using parallelProcessing="true" acquire threads from a thread pool. Camel creates a pool for you. But if you want to use multiple pools or a pool with a different configuration, then you can always supply your own.
I'm currently developing a Camel Integration app in which resumption from a previous state of processing is important. When there's a power outage, for instance, it's important that all previously processed messages are not re-processed. The processing should resume from where it left off before the outage.
I've gone through a number of possible solutions including Terracotta and Apache Shiro. I'm not sure how to use either as documentation on the integration with Apache Camel is scarce. I've not settled on the two, however.
I'm looking for suggestions on the potential alternatives I can use or a pointer to some tutorial to get me started.
The difficulty in surviving outages lies primarily in state, and what to do with in-flight messages.
Usually, when you're talking state within routes the solution is to flush it to disk, or other nodes in the cluster. Taking the aggregator pattern as an example, aggregated state is persisted in an aggregation repository. The default implementation is in memory, so if the power goes out, all the state is lost. However, there are other implementations, including one for JDBC, and another using Hazelcast (a lightweight in-memory data grid). I haven't used Hazelcast myself, but JDBC does a synchronous write to disk. The aggregator pattern allows you to resume from where you left off. A similar solution exists for idempotent consumption.
The second question, around in-flight messages is a little more complicated, and largely depends on where you are consuming from. If you're in the middle of handling a web service request, and the power goes out, does it matter if you have lost the message? The user can simply retry. Any effects on external systems can be wrapped in a transaction, or an idempotent consumer with JDBC idempotent repository.
If you are building out integrations based on messaging, you should consume within a transaction, so that if your server goes down, the messages go back into the broker and can be replayed to another consumer.
Be careful when using seda: or threads blocks, these use an in-memory queue to pass exchanges between threads, any messages flowing down these sorts of routes will be lost if someone trips over the power cable. If you can't afford message loss, and need this sort of processing model, consider using a JMS queue as the endpoints between the two routes (with transactions to ensure you pick up where you left off).
I wonder what is the difference between them. Are they describing the same thing?
Is Google App Engine Service Task Queue is an implementation of Message Queue?
I asked a similar question on some Developer Community Groups on Facebook. It was not about GoogleAppEngine specifically - i asked in more of a general sense to determine use case between RabbitMQ and Celery. Here are the responses I got which I think is relevant to the topic and fairly clarifies the difference between a message queue and a task queue.
I asked:
Will it be appropriate to say that "Celery is a
QueueWrapper/QueueFramework which takes away the complexity of having
to manage the internal queueManagement/queueAdministration activities
etc"?
I understand the book language which says "Celery is a task queue" and
"RabbitMQ is a message broker". However, it seems a little confusing
as a first-time celery user because we have always known RabbitMQ to
be the 'queue'.
Please help in explaining how/what celery does in constrast with
rabbitMQ
A response I got from Abu Ashraf Masnun
Task Queue and Message Queue. RabbitMQ is a "MQ". It receives messages
and delivers messages.
Celery is a Task Queue. It receives tasks with their related data,
runs them and delivers the results.
Let's forget Celery for a moment. Let's talk about RabbitMQ. What
would we usually do? Our Django/Flask app would send a message to a
queue. We will have some workers running which will be waiting for new
messages in certain queues. When a new message arrives, it starts
working and processes the tasks.
Celery manages this entire process beautifully. We no longer need to
learn or worry about the details of AMQP or RabbitMQ. We can use Redis
or even a database (MySQL for example) as a message broker. Celery
allows us to define "Tasks" with our worker codes. When we need to do
something in the background (or even foreground), we can just call
this task (for instant execution) or schedule this task for delayed
processing. Celery would handle the message passing and running the
tasks. It would launch workers which would know how to run your
defined tasks and store the results. So you can later query the task
result or even task progress when needed.
You can use Celery as an alternative for cron job too (though I don't
really like it)!
Another response I got from Juan Francisco Calderon Zumba
My understanding is that celery is just a very high level of
abstraction to implement the producer / consumer of events. It takes
out several painful things you need to do to work for example with
rabbitmq. Celery itself is not the queue. The events queues are stored
in the system of your choice, celery helps you to work with such
events without having to write the producer / consumer from scratch.
Eventually, here is what I took home as my final learning:
Celery is a queue Wrapper/Framework which takes away the complexity of
having to manage the underlying AMQP mechanisms/architecture that come
with operating RabbitMQ directly
GAE's Task Queues are a means for allowing an application to do background processing, and they are not going to serve the same purpose as a Message Queue. They are very different things that serve different functions.
A Message Queue is a mechanism for sharing information, between processes, threads, systems.
An AppEngine task Queue is a way for an AppEngine application to say to itself, I need to do this, but I am going to do it later, outside of the context of a client request.
Might differ depending on the context, but below is my understanding:
Message queue
Message queue is the message broker part - a queue data structure implementation, where you can:
Enqueue/produce/push/send (different terms depending on the platform, but refers to the same thing) message to.
Dequeue/consume/pull/receive message from.
Provides FIFO ordering.
Task queue
Task queue, on the other hand, is to process tasks:
At a desired pace - how many tasks can your system handle at the same time? Perhaps determined by the number of CPU cores on your machine, or if you're on Kubernetes, number of nodes and their size. It's about concurrency control, or the less-cool term, "buffering".
In an async way - non-blocking task processing. Processes tasks in the background, so your main process can go do other stuff after kicking off a task. Server API over HTTP is a popular use case, where you want to respond quickly to the client because HTTP request usually has a short timeout (<= 30s), especially when your API is triggered by end user (humans are impatient). If your task takes longer than seconds, you want to consider bring it off to the background, and give a API response like "OK I received your request, I'll process it when I have time".
Their difference
As you can see, message queue and task queue focus on different aspects, they can overlap, but not necessarily.
An example for task queue but not message queue - if your tasks don't care about ordering - each task does not depend on one another - then you don't need a "queue", FIFO data structure. You can, but you don't have to. You just need a place to store the buffered tasks like a pool, a simple SQL/NoSQL database or even S3 might suffice.
An opposite example is push notification. You use message queue but not necessarily task queue. Server generates events/notifications and wants to deliver them to the client. The server will push notifications in the queue. The client consumes/pulls down notifications from the queue when they are ready to do so. Products like GCP PubSub, AWS SNS can be used for this.
Takeaway
Task queue is usually more complicate than a message queue because of the concurrency control, not to mention if you want horizontal scaling like distributing workers across nodes to optimize concurrency.
Tools like Celery are task queue + message queue baked into one. There aren't many tools like Celery as I know that do both, guess that's why it's so popular (alternatives are Bull or Bee in NodeJS, or if you know more please let me know!).
My company recently had to implement a task queue. While googling for the proper tool these two terms confused me a lot, because I kind of know what I want, but don't know how people call it and what keyword I should search by.
I personally haven't used AppEngine much so cannot answer that, but you can always check for the points above to see if it satisfies the requirements.
If we only talk about the functionality then it's would be hard to discern the difference.
In my company, we try and fail miserably due to our misunderstanding between the two.
We create our worker queue (aka task queue aka scheduler aka cron)
and we use it for long polling. We set the task schedule 5 sec into the future (delay) to trigger the polling code. The code fires a request and checks the response. If the condition doesn't meet we would create a task again to extend the polling and not extend otherwise.
This is a DB, network and computationally intensive. Our new use case requires a fast response we have to reduce the delay to 0.1 and that is a lot of waste per polling.
So this is the prime example where technology achieve the same goal but not the same proficiency
So the answer is the main difference is in the goal Message Queue and Task Queue try to achieve.
Good read:
https://stackoverflow.com/a/32804602/3422861
If you think in terms of browser’s JavaScript runtime environment or Nodejs JavaScript runtime environment, the answer is:
The difference between the message queue and the micro-task queue (such as Promises is) the micro-task queue has a higher priority than the message queue, which means that Promise task inside the micro-task queue will be executed before the callbacks inside the message queue.