Any simple way that concurrent consumers can share data in Camel? - apache-camel

We have a typical scenario in which we have to load-balance a set of cloned consumer applications, each running in a different physical server. Here, we should be able to dynamically add more servers for scalability.
We were thinking of using the round-robin load balancing here. But we don't want a long-running job in the server cause a message to wait in its queue for consumption.
To solve this, we thought of having 2 concurrentConsumers configured for each of the server application. When an older message is processed by a thread and a new message arrives, the latter will be consumed from the queue by the second thread. While processing the new message, the second thread has to check a class (global) variable shared by the threads. If 'ON', it can assume that one thread is active (ie. job is already in progress). In that case, it re-routes the message back to its source queue. But if the class variable is 'OFF', it can start the job with the message data.
The jobs are themselves heavyweight and so we want only one job to be processed at a time. That's why the second thread re-routes the message, if another thread is active.
So, the question is 'Any simple way that concurrent consumers can share data in Camel?'. Or, can we solve this problem in an entirely different way?

For a JMS broker like ActiveMQ you should be able to simply use concurrent listeners on the same queue. It should do round robin but only with the consumers that are idle. So basically this should just work. Eventually you have to set the prefetch size to 1 as prefetch might cause a consumer to take messages even when a long running process will block them.

Related

Re-distributing messages from busy subscribers

We have the following set up in our project: Two applications are communicating via a GCP Pub/Sub message queue. The first application produces messages that trigger executions (jobs) in the second (i.e. the first is the controller, and the second is the worker). However, the execution time of these jobs can vary drastically. For example, one could take up to 6 hours, and another could finish in less than a minute. Currently, the worker picks up the messages, starts a job for each one, and acknowledges the messages after their jobs are done (which could be after several hours).
Now getting to the problem: The worker application runs on multiple instances, but sometimes we see very uneven message distribution across the different instances. Consider the following graph, for example:
It shows the number of messages processed by each worker instance at any given time. You can see that some are hitting the maximum of 15 (configured via the spring.cloud.gcp.pubsub.subscriber.executor-threads property) while others are idling at 1 or 2. At this point, we also start seeing messages without any started jobs (awaiting execution). We assume that these were pulled by the GCP Pub/Sub client in the busy instances but cannot yet be processed due to a lack of executor threads. The threads are busy because they're processing heavier and more time-consuming jobs.
Finally, the question: Is there any way to do backpressure (i.e. tell GCP Pub/Sub that an instance is busy and have it re-distribute the messages to a different one)? I looked into this article, but as far as I understood, the setMaxOutstandingElementCount method wouldn't help us because it would control how many messages the instance stores in its memory. They would, however, still be "assigned" to this instance/subscriber and would probably not get re-distributed to a different one. Is that correct, or did I misunderstand?
We want to utilize the worker instances optimally and have messages processed as quickly as possible. In theory, we could try to split up the more expensive jobs into several different messages, thus minimizing the processing time differences but is this the only option?

Individual connections VS shared DB connection in a multi-thread app?

I'm currently supporting a multi-thread app (Delphi 10.4) that generates about 20 threads to take data from remote punch-clocks and enter it into a DB table (same one for all threads), but does so by having each thread generate its own (MSADO) connection on construction. Now while this does make the application thread-safe since they don't share a resource, would it be a better idea efficiency-wise to make one connection that the threads would share and ensure thread-safety by using TMonitor, critical sections or something similar?
Depends how much writes you have to the database. If you use the single connection and still use one insert statement for each thread, then you are not solving anything. In addition, your application will slow down due to the synchronization between the threads waiting their turn on the database connection.
To do this in a proper way, you will need to apply something like a producer - consumer pattern with a queue between the producer(s) (the threads fetching data from the punch-clock) and the consumer (the db writer thread).
Once the reader thread fetches the data, it will:
lock the queue access
add it to the queue.
unlock the queue
Once the writer thread starts running, it will:
lock the queue access
gather all the data from the queue,
remove messages from the queue
unlock the queue
prepare singe INSERT statement for all rows from the queue
execute the transaction on the DB
sleep for a short period of time (allow other threads to work)
This is not data safe approach, since if there is failure between the data removed from the queue and data committed to the database, the data will be lost.
As you see, this is more complex approach, so if you don't experiance DB congestion, it's not worth to use single connection.
Note: There is also approach to use a pool of DB connections which is the most used pattern in this cases. With the DB pool, a smaller number of connections are shared among large number of threads trying to read/write the database.

Multiple Threads Trying to write on the same No-Sql Datastore at the same time

We have a internal no-sql datastore for our company. I provides Key-Value storage. The value is of JSON format. I will refer it as XYZ DataStore.
Problem Statement:
I have an application which is spawning 10-15 threads at a time. Each thread is responsible for writing to the same XYZ concurrently. Though the records being PUT, are different, (meaning different Key). The XYZ Rest Client created is singleton, meaning all the threads are using one singleton client. (I am using spring beans to create singleton client).
Observation: Only one thread is able to put the records in XYZ. The other threads are not able to write to sable at all, not even after a delay time.
How can I handle this concurrent writing to XYZ? What is the preferred way?
I can achieve it by following:
Implement Lock on the PUT API on my end and even if concurrent
threads are attempting to write, with a single thread, it should be
able to wait until the lock is released.
I am not sure how to implement this. If anyone has pointers, it will be great.
The above are like the producer threads. I can create one consumer thread. The producer thread would write this record to be put in a
Queue and the consumer thread would be reading it one by one and
updating it.
Here, I will be using java.util.concorrent. BlockingQueue to read and write in a queue, being used by consumer and producer threads. Is this the correct way?
Can anyone suggest me which is the best way to do it?
BTW, the application is built in Java, using spring framework.
TIA
Even singleton as well ensure you are not sharing anything across threads and each method in that class is separate only then you can very well set that as singleton and dont worry about the blocking since its going to be a method call so each thread will simply call that method. thats it. Here you know each transaction is failed or not based on the method response so based on that you decide whether the transaction is success or re-initiate it for later. No delay required.
In Queuing system you can probably consider DeQueue its much better on in concurrent package. see some examples and implement it. This is as well effective but here you are sharing the queue across multiple threads. You wont feel that slowness but sharing is happening here. Here is there is no such transaction is passed for the pushed values or not. One thing we can do it after successful only we can dequeue the value otherwise just do the look alone. if keep on fails move to separate queue for later use. this is one more pardon for us. No delay required.

is camel-cxf consumer multithreaded and how to check if an component uses multiple threads?

We are using camel-cxf as consumer (soap) in our project and asked ourself if camel-cxf uses multiple threads to react on requests.
We think it uses multiple threads, right?!
But what does this mean for the rest of the route? Is all multithreaded after "from" or is there a point of synchronization?
And what does this mean for "parallelProcessing" or "threads"?
In our case we use jdbc component later in the route. IS camel-jdbc also using multiple threads?
How to know in general what threading model is used by a given component?
Let's start with your last question:
How to know in general what threading model is used by a given
component?
You are probably asking which component is single-threaded by default and which ones are multi-threaded?
You need to ask yourself which approach makes most sense for a component and read the component's documentation. Normally the flags will tell you what behavior is applied by default. CXF is a component that requires a web server, jetty in this case, for a SOAP (over HTTP) client to be able to call the service. HTTP is a stateless protocol, a web server has to scale to many clients, thus it makes a lot of sense for a web server to be multi-threaded. So yes, two simultanious requests to a CXF endpoint are handled by two separate (jetty) threads. The route starting at the CXF endpoint is executed simultaniously by the jetty threads that received the request.
On the contrary, if you are polling for file system changes, e.g. you want to check if a certain file was created, it makes no sense to apply multiple threads to the task of polling. Thus the file consumer is single threaded. The thread employed by the file consumer to do the polling will also execute your route that processes the file(s) that were picked up during a poll.
If processing the files identified by a poll takes a long time compared to your polling intervall, and you cannot afford to miss a poll, then you need to hand of the processing of the rest of the route to another thread so your polling thread is again free to do, well, polling. Enter the Threads DSL.
Then you have processors like the splitter that create many tasks from a single task. To make the splitter work for everyone it must be assumed that the tasks created by the splitter cannot be performed out of order and/or fully independent of each other. So the safe default is to run the steps wrapped by the split step in the thread that executes the route as a whole. But if you the route author knows that the individual split items can be processed independent of each other, then you can parallelize the processing of the steps wrapped by the split step by setting parallelProcessing="true".
Both the threads DSL and the using parallelProcessing="true" acquire threads from a thread pool. Camel creates a pool for you. But if you want to use multiple pools or a pool with a different configuration, then you can always supply your own.

Message Queue vs Task Queue difference

I wonder what is the difference between them. Are they describing the same thing?
Is Google App Engine Service Task Queue is an implementation of Message Queue?
I asked a similar question on some Developer Community Groups on Facebook. It was not about GoogleAppEngine specifically - i asked in more of a general sense to determine use case between RabbitMQ and Celery. Here are the responses I got which I think is relevant to the topic and fairly clarifies the difference between a message queue and a task queue.
I asked:
Will it be appropriate to say that "Celery is a
QueueWrapper/QueueFramework which takes away the complexity of having
to manage the internal queueManagement/queueAdministration activities
etc"?
I understand the book language which says "Celery is a task queue" and
"RabbitMQ is a message broker". However, it seems a little confusing
as a first-time celery user because we have always known RabbitMQ to
be the 'queue'.
Please help in explaining how/what celery does in constrast with
rabbitMQ
A response I got from Abu Ashraf Masnun
Task Queue and Message Queue. RabbitMQ is a "MQ". It receives messages
and delivers messages.
Celery is a Task Queue. It receives tasks with their related data,
runs them and delivers the results.
Let's forget Celery for a moment. Let's talk about RabbitMQ. What
would we usually do? Our Django/Flask app would send a message to a
queue. We will have some workers running which will be waiting for new
messages in certain queues. When a new message arrives, it starts
working and processes the tasks.
Celery manages this entire process beautifully. We no longer need to
learn or worry about the details of AMQP or RabbitMQ. We can use Redis
or even a database (MySQL for example) as a message broker. Celery
allows us to define "Tasks" with our worker codes. When we need to do
something in the background (or even foreground), we can just call
this task (for instant execution) or schedule this task for delayed
processing. Celery would handle the message passing and running the
tasks. It would launch workers which would know how to run your
defined tasks and store the results. So you can later query the task
result or even task progress when needed.
You can use Celery as an alternative for cron job too (though I don't
really like it)!
Another response I got from Juan Francisco Calderon Zumba
My understanding is that celery is just a very high level of
abstraction to implement the producer / consumer of events. It takes
out several painful things you need to do to work for example with
rabbitmq. Celery itself is not the queue. The events queues are stored
in the system of your choice, celery helps you to work with such
events without having to write the producer / consumer from scratch.
Eventually, here is what I took home as my final learning:
Celery is a queue Wrapper/Framework which takes away the complexity of
having to manage the underlying AMQP mechanisms/architecture that come
with operating RabbitMQ directly
GAE's Task Queues are a means for allowing an application to do background processing, and they are not going to serve the same purpose as a Message Queue. They are very different things that serve different functions.
A Message Queue is a mechanism for sharing information, between processes, threads, systems.
An AppEngine task Queue is a way for an AppEngine application to say to itself, I need to do this, but I am going to do it later, outside of the context of a client request.
Might differ depending on the context, but below is my understanding:
Message queue
Message queue is the message broker part - a queue data structure implementation, where you can:
Enqueue/produce/push/send (different terms depending on the platform, but refers to the same thing) message to.
Dequeue/consume/pull/receive message from.
Provides FIFO ordering.
Task queue
Task queue, on the other hand, is to process tasks:
At a desired pace - how many tasks can your system handle at the same time? Perhaps determined by the number of CPU cores on your machine, or if you're on Kubernetes, number of nodes and their size. It's about concurrency control, or the less-cool term, "buffering".
In an async way - non-blocking task processing. Processes tasks in the background, so your main process can go do other stuff after kicking off a task. Server API over HTTP is a popular use case, where you want to respond quickly to the client because HTTP request usually has a short timeout (<= 30s), especially when your API is triggered by end user (humans are impatient). If your task takes longer than seconds, you want to consider bring it off to the background, and give a API response like "OK I received your request, I'll process it when I have time".
Their difference
As you can see, message queue and task queue focus on different aspects, they can overlap, but not necessarily.
An example for task queue but not message queue - if your tasks don't care about ordering - each task does not depend on one another - then you don't need a "queue", FIFO data structure. You can, but you don't have to. You just need a place to store the buffered tasks like a pool, a simple SQL/NoSQL database or even S3 might suffice.
An opposite example is push notification. You use message queue but not necessarily task queue. Server generates events/notifications and wants to deliver them to the client. The server will push notifications in the queue. The client consumes/pulls down notifications from the queue when they are ready to do so. Products like GCP PubSub, AWS SNS can be used for this.
Takeaway
Task queue is usually more complicate than a message queue because of the concurrency control, not to mention if you want horizontal scaling like distributing workers across nodes to optimize concurrency.
Tools like Celery are task queue + message queue baked into one. There aren't many tools like Celery as I know that do both, guess that's why it's so popular (alternatives are Bull or Bee in NodeJS, or if you know more please let me know!).
My company recently had to implement a task queue. While googling for the proper tool these two terms confused me a lot, because I kind of know what I want, but don't know how people call it and what keyword I should search by.
I personally haven't used AppEngine much so cannot answer that, but you can always check for the points above to see if it satisfies the requirements.
If we only talk about the functionality then it's would be hard to discern the difference.
In my company, we try and fail miserably due to our misunderstanding between the two.
We create our worker queue (aka task queue aka scheduler aka cron)
and we use it for long polling. We set the task schedule 5 sec into the future (delay) to trigger the polling code. The code fires a request and checks the response. If the condition doesn't meet we would create a task again to extend the polling and not extend otherwise.
This is a DB, network and computationally intensive. Our new use case requires a fast response we have to reduce the delay to 0.1 and that is a lot of waste per polling.
So this is the prime example where technology achieve the same goal but not the same proficiency
So the answer is the main difference is in the goal Message Queue and Task Queue try to achieve.
Good read:
https://stackoverflow.com/a/32804602/3422861
If you think in terms of browser’s JavaScript runtime environment or Nodejs JavaScript runtime environment, the answer is:
The difference between the message queue and the micro-task queue (such as Promises is) the micro-task queue has a higher priority than the message queue, which means that Promise task inside the micro-task queue will be executed before the callbacks inside the message queue.

Resources