Camel route - Filter all but first message - apache-camel

Can I filter messages so only one with a given correlation expression is forwarded?
I have a stream of messages from different devices. I want to keep an SQL table with all devices already encountered.
Trivial way would be to route all messages to an sql component with an insert statement. But this would create unnecessary load on the DB because devices send with a high frequency.
My current solution is to have a java predicate that returns true the first time the device id is encountered since last restart.
This works, but I would like to see if I can replace this with camel on-board methods - potentially making the route easier to understand.
Is there some way to use aggregation to only pass the first message with a given correlation value?

There is the Camel idempotent consumer that does exactly this.
With the help of a repository of already processed messages it drops any further message with the same identification characteristics.
This is very handy wherever you have at-least-once semantics on message delivery.

Related

Need advice on migrating from Flink DataStream Job to Flink Stateful Functions 3.1

I have a working Flink job built on Flink Data Stream. I want to REWRITE the entire job based on the Flink stateful functions 3.1.
The functions of my current Flink Job are:
Read message from Kafka
Each message is in format a slice of data packets, e.g.(s for slice):
s-0, s-1 are for packet 0
s-4, s-5, s-6 are for packet 1
The job merges slices into several data packets and then sink packets to HBase
Window functions are applied to deal with disorder of slice arrival
My Objectives
Currently I already have Flink Stateful Functions demo running on my k8s. I want to do rewrite my entire job upon on stateful functions.
Save data into MinIO instead of HBase
My current plan
I have read the doc and got some ideas. My plans are:
There's no need to deal with Kafka anymore, Kafka Ingress(https://nightlies.apache.org/flink/flink-statefun-docs-release-3.0/docs/io-module/apache-kafka/) handles it
Rewrite my job based on java SDK. Merging are straightforward. But How about window functions?
Maybe I should use persistent state with TTL to mimic window function behaviors
Egress for MinIO is not in the list of default Flink I/O Connectors, therefore I need to write my custom Flink I/O Connector for MinIO myself, according to https://nightlies.apache.org/flink/flink-statefun-docs-release-3.0/docs/io-module/flink-connectors/
I want to avoid Embedded module because it prevents scaling. Auto scaling is the key reason why I want to migrate to Flink stateful functions
My Questions
I don't feel confident with my plan. Is there anything wrong with my understandings/plan?
Are there any best practice I should refer to?
Update:
windows were used to assemble results
get a slice, inspect its metadata and know it is the last one of the packet
also knows the packet should contains 10 slices
if there are already 10 slices, merge them
if there are not enough slices yet, wait for sometime (e.g. 10 minutes) and then either merge or record packet errors.
I want to get rid of windows during the rewrite, but I don't know how
Background: Use KeyedProcessFunctions Rather than Windows to Assemble Related Events
With the DataStream API, windows are not a good building block for assembling together related events. The problem is that windows begin and end at times that are aligned to the clock, rather than being aligned to the events. So even if two related events are only a few milliseconds apart they might be assigned to different windows.
In general, it's more straightforward to implement this sort of use case with keyed process functions, and use timers as needed to deal with missing or late events.
Doing this with the Statefun API
You can use the same pattern mentioned above. The function id will play the same role as the key, and you can use a delayed message instead of a timer:
as each slice arrives, add it to the packet that's being assembled
if it is the first slice, send a delayed message that will act as a timeout
when all the slices have arrived, merge them and send the packet
if the delayed message arrives before the packet is complete, do whatever is appropriate (e.g., go ahead and send the partial packet)

Flink Request/Response pattern possible with combined source/sink?

I know that by design and out of the box a request and reply data processing is not possible with Flink. But consider for example a legacy TCP application, which opens a connection to a server and expects a response in tha same connection.
For example consider a legacy application, where the clients connect to a server via TCP and a custom protocol. They send some status information and expect a command as the response, where the command may depend on the current status.
Is it possible, to build a combined source,which inputs the TCP message into the processing, and sink, which recieves the processing result?
Building a source, which accepts TCP connections and creates events from messages seems straightforward, but getting the corresponding response to the corrent sink on the same worker(!) to send the response to the client see s tricky.
I know, that this can be implemented with an external component, but I'm wondering if this can be implemented directly in Flink with minimal overhead (e.g. for realtime performance reasons).
If this is possible, what would be the ways to do it and with which pros and cons?
Thank you!
Regards,
Kan
It depends how your server-processing pipeline looks like.
If the processing can be modeled as a single chain, as in Source -> Map/flatMap/filter -> Map/flatMap/filter -> ... -> sink, then you could pass the TCP connection itself the next operation together with the data (I supposed wrapped in a tuple or POJO). By virtue of being part of a chain it is guaranteed that the entire computation happens within a single worker.
But, the moment you do anything like grouping, windows etc. this is no longer possible, since the processing may continue on another worker.
Normally if you're talking to an external service in Flink, you'd use an AsyncFunction. This lets you use incoming data to determine what request to make, and emit the results as the operator output. Is there any reason why this approach wouldn't work for you?
Note that you can play some games if you don't have any incoming data, e.g. have a source that regularly emits a "tickler" record, which then triggers the async request.
And if the result needs to feed back into the next request, you can use iterations, though they have limitations.

Why doesn't Flink dashboard show the number of records received from the source or written to a sink?

The Flink dashboard is great and shows a lot of details for jobs that are running. One thing I have noticed, however, is that the source and sinks of a job will show the records received and records sent as 0 respectively.
Now I know that they are still receiving and sending records to and from outside of the job, but that 0 tends to be very confusing to people. Is there a reason why this was chosen to be like this? Or a way make it not be 0?
For sinks in particular, if the serialization schema fails to serialize a message (and the error is captured and logged instead of causing the job to fail) you can't see the number the sink has actually output to reflect this. You just always see 0 and would assume everything made it through.
The reason is that we can't measure this in a generalized fashion and have to implement the measuring in each source/sink respectively for which we just haven't found the time yet. Another issue is that this would have to be done within user-defined functions, but the relevant metrics are not accessible from there (yet).
See https://issues.apache.org/jira/browse/FLINK-7286.

Could we maintain order of messages in AWS-IoT at subscriber end?

We have created a thing using AWS-IoT service. We have created a topic in that particular thing. Subscriber has subscribed to that topic and publisher is sending messages to that topic.
Below is the publisher messaging order:
message 0
message 1
message 2
message 3
message 4
At the subscriber end the sequence of messages is not maintained. It's showing like this:
message 0
message 1
message 4
message 2
message 3
True, in AWS IoT, the message broker does not guarantee order while they deliver messages to the devices.
The reason being that in a typical distributed systems architecture, a single message from the publisher to the subscriber shall take multiple paths to ensure that the system is highly available and scalable. In the case of AWS IoT, the Device Gateway supports the publisher subscriber messaging pattern and enables scalable, low-latency, and low-overhead communication.
However, based on the type of use case, there are many possible solutions that can be worked out. There should be a logic such that the publishers themselves shall do the co-ordination. One generic or simple approach could be that a sequence number addition at the device side should be sufficient to handle the ordering of the messages between publisher and subscriber. On the receiver, a logic to process or discard based on checking of the ordering based on sequence number should be helpful.
As written in the documentation of AWS
The message broker does not guarantee the order in which messages and
ACK are received.
I guess its too late to answer to this question but I'll still go ahead so others facing this issue can have a work around. I faced a similar scenario and I did the following to make sure that the order is maintained.
I added sequence ID or timestamp to the payload sent to the broker from my iot device (can be any kind of client)
I then configured the IoT rules engine (add actions) to send the messages directly to DynamoDB where the data was automatically stored in a sorted manner (needs to be configured to sort by seqID).
Then I used Lambda to pull out the data from DynamoDB for my further workflow but you can use whatever service according to yours.

Not persisting messages when the system comes up in the wrong order

We're sending messages to Apache Camel using RabbitMQ.
We have a "sender" and a Camel route that processes a RabbitMQ message sent by the sender.
We're having deployment issues regarding which end of the system comes up first.
Our system is low-volume. I am sending perhaps 100 messages at a time. The point of the message is to reduce 'temporal cohesion' between a thing happening in our primary database, and logging of same to a different database. We don't want our front-end to have to wait.
The "sender" will create an exchange if it does not exist.
The issue is causing deployment issues.
Here's what I see:
If I down the sender, down Camel, delete the exchange (clean slate), start the sender, then start Camel, and send 100 messages, the system works. (I think because the sender has to be run manually for testing, the Exchange is being created by the Camel Route...)
If I clean slate, and send a message, and then up Camel afterwards, I can see the messages land in RabbitMQ (using the web tool). No queues are bound. Once I start Camel, I can see its bound queue attached to the Exchange. But the messages have been lost to time and fate; they have apparently been dropped.
If, from the current state, I send more messages, they flow properly.
I think that if the messages that got dropped were persisted, I'd be ok. What am I missing?
For me it's hard to say what exactly is wrong, but I'll try and provide some pointers.
You should set up all exchanges and queues to be durable, and the messages persistent. You should never delete any of these entities (unless they are empty and you no longer use them) and maybe look at them as tables in a database. It's your infrastructure of sorts, and as with database, you wouldn't want that the first DB client to create a table that it needs (this of course applies to your use case, at least that's what it seems to me).
In the comments I mentioned flow state of the queue, but with 100 messages this will probably never happen.
Regarding message delivery - persistent or not, the broker (server) keeps them until they are consumed with acknowledgment that's sent back by the consumer (in lot's of APIs this is done automatically but it's actually one of the most important concepts).
If the exchange to which the messages were published is deleted, they are gone. If the server gets killed or restarted and the messages are persisted - again, they're gone. There may as well be some more scenarios in which messages get dropped (if I think of some I'll edit the answer).
If you don't have control over creating (declaring usually in the APIs) exchanges and queues, than (aside from the fact that's it's not the best thing IMHO) it can be tricky since declaring those entities is idempotent, i.e. you can't create a durable queue q1 , if a non durable queue with the same name already exists. This could also be a problem in your case, since you mention the which part of the system comes first thing - maybe something is not declared with same parameters on both sides...

Resources