The apache camel eip framework has a component supporting reactive streams.
From what i can say from the documentation, the reactive-streams component only works in a single JVM, routing reactive streams from/to camel.
What would be the an appropriate mechanism for having a camel producer in one application, and a camel consumer in another application which produce and consume reactive streams over the network?
I assume some kind of middleware is needed, which one would be suited for this scenario?
RSocket would be a good choice http://rsocket.io/
It extends the reactive stream model over the network.
The Java implementation https://github.com/rsocket/rsocket-java uses Project Reactor internally, so RX operators can work automatically and you get back pressure etc.
It's well supported and recently adopted by Spring Framework.
Related
Being quite familiar with Apache Camel, I am a new bee in Kafka Streams. I am learning Kafka streams, but could not find any relevant answer for the below query,
Being a library both Camel and Kafka Streams can create pipelines to extract data, polishing/transforming and load into some sink using a processor. Camel also supports stream processing. I want to understand the
difference between these two since I feel Camel library to be more generic than Kafka Stream which is not relevant for systems where there is no Kafka broker (no sure if this is wrong)
which library is recommended for which type of use case
Thanks in advance.
Kafka Streams is a stream processing framework, that consumes messages from Kafka topics and writes them back to other Kafka topics. It brings support for stateful transformations such as aggregations to tables and similar, leveraging RocksDB, when necessary. You can provide Rest endpoints to such tables/stores, but that is already extending the Kafka Streams features.
Another possible extension is, to send messages somewhere else than Kafka. You will have to provide the client to do so yourself. With that regards, Kafka Streams' scope is much less versatile than Apache Camel. Because of that specialisation, it supports various Kafka specific features, such as parallel processing based on Kafka consumer groups, predefined message envelopes and exactly once semantics. One of the most important feature is the support of "stream time" in Kafka streams, which allows reprocessing of messages by their Kafka timestamps regardless of the Wall-Clock-Time.
You can have a look on KSQL, which is build on top of Kafka Streams, to get an idea, what is possible to build with Kafka Streams.
In short, if you have data in Kafka, that you want to process and write back to Kafka for other programs to consume, Kafka Streams is a very helpful framework. It even has a similar deployment model as Apache Camel. However, if you need to integrate different technologies with Kafka, you need to stay with Apache Camel. Note, there is Kafka Connect in the Apache Kafka family, that is geared towards the integration of data from other systems with Apache Kafka.
So, we have a use case in our production systems where we could probably use Akka streams. To understand how Akka streams exactly provide back pressure, I would like to go a bit deeper into our requirements.
We have a Solr cluster that hosts some of our data. Next, we have a Play app that serves the front-end customer facing site. Every in-coming request ultimately boils down to fetching a good deal of data from Solr using the /sql handler that Solr provides. Once we fetch the entire dataset from Solr, we write it back after morphing it, to a Cassandra cluster. This can be converted into a problem which can be solved using Akka streams where the Solr stream from the /sql handler will be the akka Source and the Cassandra storage will be the Sink and everything in between will be custom Flows.
I was studying Akka streams and understand it's an implementation of the Reactive streams. Most notably, the way Akka streams provide back pressure to make sure the customer isn't overwhelmed by the producer. Now, with respect to my use case, I want to understand how Akka provides back pressure.
As I can see it, there's a reactive streams library for Cassandra. Since it's the consumer in our case, this driver will be capable of signalling to the producer about how much data it will be able to receive. That would mean, there has to be a corresponding driver on the producer side that can react to this signal and control the emitting of elements. Specifically, since the producer in our case is Solr, isn't it correct that I would also have to use a reactive-compliant Solr driver that I can use to fetch documents from Solr and stream it in my application? This driver would then be capable of controlling the rate at which it has to fetch the documents from the Solr cluster whenever the Cassandra reactive driver signals it to backpressure. Isn't this correct?
If that is indeed the case, will using Akka streams without a non-reactive driver on the producer side provide any benefits? Specifically, are there other ways that Akka publishers can provide back pressure capabilities in such cases when the driver isn't reactive-compliant?
For Solr, there's also a fully reactive Akka Streams implementation from the Alpakka project, so using that as the Source would handle backpressure, though it would mean not using the SQL interface for expressing the query.
On the other hand, since the Solr SQL interface is essentially a JDBC facade which uses Solr, it's possible to use the Alpakka Slick integration as long as you define an instance of slick.jdbc.JdbcProfile which uses the Solr JDBC driver.
While exploring Akka streams, I also came across Apache Flink which stream processing engine.
Akka streams implements reactive streams and supports back pressure.
So if I have to make decision between two, which one should I go for? How do they differ and whats the similarity? What should be the criteria here?
Akka Streams is a library implementing reactive streams specification.
Apache Flink is a streaming engine.
The main high level difference is that in Apache Flink you create a job by coding against one of Flink APIs and you submit that job to Apache Flink cluster. It is the Apache Flink cluster that executes your stream processing job. By using Akka Streams you are creating a standalone application. In that sense Akka Streams is a more lightweight of the two.
You can still distribute Akka Streams based app by using StreamRefs, though you need to do that explicitly in the code and you need to run Akka Cluster. Apache Flink already manages a cluster so you don't need to do that explicitly in your code (though you still need the cluster set up and running to submit your jobs to). Apache Flink has smarts built in to take a job and execute it in an optimal way. Parallelizing/distributing execution when possible. You don't get that with Akka Streams.
Apache Flink stream processing is designed to achieve end2end exactly once processing semantics in face of failures. In Akka Streams such guarantee would need to be implemented explicitly in your code.
Akka Streams as reactive streams specification implementation is all about asynchronous and memory bound processing. Akka HTTP for example is built on top of Akka Streams and as a result implements a very efficient and lightweight client and server sides of HTTP protocol.
Akka Streams implements asynchronous non-blocking backpressure (as per reactive streams specification) to guarantee the memory boundedness during execution. Apache Flink also has a backpressure mechanism, though it's not implemented in the same way.
Akka Streams as an implementation of reactive streams specification can interoperate with other implementations like RxJava or Project Reactor. Apache Flink is not part of any broader standard.
I would say the main reasons to go for Apache Flink is the exactly once guarantees and automated distribution that comes with it. Otherwise Akka Streams is a very powerful API with simpler execution model.
EDIT:
Probably worth mentioning project Alpakka that brings a lot of technologies to Akka Streams so that they can be plugged in to reactive streams based processing.
I am not an expert in Akka Streams, but as far as I know, the main difference is that Flink offers the distribution of processing out of the box, while Akka Streams does not, since it was designed to process data on a single node.
The similarity between the two is that they both offer stream processing capabilities and in this sense, they probably have similar functionality.
But, Flink has multiple additional modules like SQL, CEP, or Machine Learning that You won't be able to get in Akka Streams. Also, Flink provides fail-safety and state recovery, which I am not sure if is present in Akka Streams out of the box.
On the other hand, setting up Akka Streaming will require less work as You don't need to care about setting JobManager & TaskManager but You can simply create a Java/Scala application, dockerize & run it somewhere.
So, the main question You should ask Yourself is, if the data You are processing is big enough that it will need to be processed on multiple nodes if it is then You really have no choice other than Flink (just in scenario Akka Streams vs. Flink). If however, the data You are going to process can be processed on a single node, then You should assess the fail-safety & message delivery guarantees You need. In the general case scenario, using Akka Streams may be easier to start with, but Flink may take over when it comes to productionizing the app.
I'm using akka + camel to consume message from activemq, and I'm trying to figure out how to deploy this consumer in multiple machines without duplicate the message. In this case I'm consuming message from a topic and the activemq should know I have one akka system in various machines, instead of various single independent systems.
I tried to accomplish that using akka cluster, but that example using a frontend that subscribe to a cluster of backend does not help since my "backend" actor is the activemq consumer itself and I can't tell activemq to subscribe to my cluster.
Any ideas?
JMS versions < 2.0 does not allow multiple nodes to share a topic subscription (not duplicating the message to each consumer). To cope with that ActiveMQ provides Virtual Topic (you can consume messages published to a topic from a Queue which allows for multiple consumers - load balancing).
It's all naming conventions. So you simply publish to the topic VirtualTopic.Orders and then consume from the queue Consumer.ClusterX.VirtualTopic.Orders. Naming conventions could be changed - see docs.
http://activemq.apache.org/virtual-destinations.html
I am considering which technology to use for the following use case:
the system is event driven
there is a flow (which is mostly without forks except for error handling)
the flow itself should be sync , but non-blocking
the possibilities I have so far are:
pure java -
this make the code not so clear as I have to nest the callbacks inside one another
and have to write everything myself
apache-camel - use the camel routes
i.e.:
from(URI)
.transform(creatUserExpression) //prepare msg to send to db
.inOut(DB.URI) //send to db
.transform(UserCreatedExpression) //prepare msg to send to next step
.inout(OtherService.URI)
.end();
this looks like a nice solution, but is camel suited to handling all my business logic - all the flows for the events?
camel is mostly used for integration between services, so we are not sure if it would be correct to use it for the business logic
java RX - looks like a possible option , still don't know it enough, and is it production ready?
current release is 0.20.7 - not yet a 1.X version
akka - tried to use it for the flow - but in order to make sure the flows goes only 1 way we needed to use FSM which caused the code to be too complicated and we decided against it
any other suggestions will be appreciated
I agree that most of the time you typically try and stay clear from as much business logic in Integration routes. Business logic on a ESB is typically a big no, some of the more rigid architects I have known will break into violent swearing when they see business logic in the integration layer. This point of view makes sense when you are using a ESB system to integrate services.
In a SOA/Services world you don't want the producers and consumers to be tightly coupled and adding business logic to the integration layer breaks that abstraction. A consumer should be able to consume data from a SAP system, C# web service, Java service or any other service without knowing how the producer work. It should just understand the data.
Apache Camel is not a ESB, it is a EIP toolkit/framework. You can use Apache camel in a client application as well. This is the one of the reasons I really have a soft spot for Camel. It is library I can use to create integration routes. It is flexible and can be used by itself without the need for a full scale server.
So in your case I don't see a problem using apache camel for this purpose. If you are going to install ServiceMix, FuseESB or another full ESB system you are just going to overly complicate the whole setup.
My suggestion(it is just a suggestion) is that in this case having business logic in your route is not going to be bad as this is not really(from your description) about integration but leveraging the power of Camel to create and maintain a event system. Remember Camel does not come with a runtime environment so you still need to host this route somewhere. A simple run time container would be Apache Karaf. You can use this OSGi kernel to install and run your routes on. Last time I check the Karaf project was like under 40MB unzipped so compared to some of the other run times it is really small.
I have used Camel in this fashion to create and host services for android client for example. I guess my main message is that Camel can be considered as a routing engine or routing engine builder which specialises in integration. Camel is not an ESB so the concerns about business logic in here is not always applicable.