Can I rely on CamelSplitComplete when streaming? - apache-camel

We have a process where we are processing a large file. We are using a splitter and using streaming().
The docs say
streaming If enabled then Camel will split in a streaming fashion, which means it will split the input message in chunks. This reduces the memory overhead. For example if you split big messages its recommended to enable streaming. If streaming is enabled then the sub-message replies will be aggregated out-of-order, eg in the order they come back. If disabled, Camel will process sub-message replies in the same order as they where splitted.
So I know that exchanges can be aggregated out of order. So does the splitter mark the last exchange it handles with the CamelSplitComplete set to true? If so, then it could get aggregated out of order and I'll end up considering my aggregation complete before I've aggregated all messages. This would lead to missing data.
If instead it marks the exchange CamelSplitComplete only when it knows it's the last one to be aggregated, then I believe I can rely on it.
UPDATE:
Assuming that it is safe to rely on CamelSplitComplete in the case above, is it safe to rely on it if my routes do filtering? I assume not, because the last row might match the filter criteria and be removed.

I have done split of large files with streaming and I have used the CamelSplitComplete property to do some processing after split is done. So yes, you can rely on it to be the last exchange. Off course, it is best to have a Camel unit test to verify test. But it worked for me. I can't say about filter, since what if you filtered out the last exchange?

Related

An Alternative Approach for Broadcast stream

I have two different streams in my flink job;
First one is representing set of rules which will be applied to the actual stream. I've just broadcasted these set of rules. Changes are come from kafka, and there can be a few changes each hour (like 100-200 per hour).
Second one is actual stream called as customer stream which contains some numeric values for each customer. This is basically keyed stream based on customerId.
So, basically I'm preparing my actual customer stream data, then applying some rules on keyed stream, and getting the calculated results.
And, I also know which rules should be calculated by checking a field of customer stream data. For example; a field of customer data contains value X, that means job have to apply only rule1, rule2, rule5 instead of calculating all the rules (let's say there are 90 rules) for the given customer. Of course, in this case, I have to get and filter all rules by field value of incoming data.
Everything is ok in this scenario, and perfectly fits broadcast pattern usage. But the problem here is that huge broadcast size. Sometimes it can be very huge, like 20 GB or more. It supposes it's very huge for broadcast state.
Is there any alternative approach to solve this limitation? Like, using rocks db backend (I know it's not supported, but I can implement custom state backend for broadcast state if there is no limitation about this).
Is there any changes if I connect both streams without broadcasting rules stream?
From your description it sounds like you might be able to avoid broadcasting the rules (by turning this around and broadcasting the primary stream to the rules). Maybe this could work:
make sure each incoming customer event has a unique ID
key-partition the rules so that each rule has a distinct key
broadcast the primary stream events to the rules (and don't store the customer events)
union the outputs from applying all the rules
keyBy the unique ID from step (1) to bring together the results from applying each of the rules to a given customer event, and assemble a unified result
https://gist.github.com/alpinegizmo/5d5f24397a6db7d8fabc1b12a15eeca6 shows how to do fan-out/fan-in with Flink -- see that for an example of steps 1, 4, and 5 above.
If there's no way to partition the rules dataset, then I don't think you get a win by trying to connect streams.
I would check out Apache Ignite as a way of sharing the rules across all of the subtasks processing the customer stream. See this article for a description of how this could be one.

can I use KSQL to generate processing-time timeouts?

I am trying to use KSQL to do whatever processing I can within a time limit and get the results at that time limit. See Timely (and Stateful) Processing with Apache Beam under "Processing Time Timers" for the same idea illustrated using Apache Beam.
Given:
A stream of transactions with unique keys;
Updates to these transactions in the same stream; and
A downstream processor that wants to receive the updated transactions at a specific timeout - say 20 seconds - after the transactions appeared in the first stream.
Conceptually, I was thinking of creating a KTable of the first stream to hold the latest state of the transactions, and using KSQL to create an output stream by querying the KTable for keys with (create_time + timeout) < current_time. (and adding the timeouts as "updates" to the first stream so I could filter those out from the KTable)
I haven't found a way to do this in the KSQL docs, and even if there were a built-in current_time, I'm not sure it would be evaluated until another record came down the stream.
How can I do this in KSQL? Do I need a custom UDF? If it can't be done in KSQL, can I do it in KStreams?
=====
Update: It looks like KStreams does not support this today - Apache Flink appears to be the way to go for this use case (and many others). If you know of a clever way around KStreams' limitations, tell me!
Take a look at the punctuate() functionality in the Processor API of Kafka Streams, which might be what you are looking for. You can use punctuate() with stream-time (default: event-time) as well as with processing-time (via PunctuationType.WALL_CLOCK_TIME). Here, you would implement a Processor or a Transformer, depending on your needs, which will use punctuate() for the timeout functionality.
See https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html for more information.
Tip: You can use such a Processor/Transformer also in the DSL of Kafka Streams. This means you can keep using the more convenient DSL, if you like to, and only need to plug in the Processor/Transformer at the right place in your DSL-based code.

Iteration over multiple streams in Apache Flink

My Question in regarding iteration over multiple streams in Apache Flink.
I am a Flink beginner, and I am currently trying to execute a recursive query (e.g., datalog) on Flink.
For example, a query calculates the transitive closure for every 5mins (tumbling window). If I have one input stream inputStream (consists of initial edge informations), another outputStream (the transitive closure) which is initialised by the inputStream. And I want to iteratively enrich the outputStream by joining the inputStream. For each iteration, the feedback should be the outputStream, and the iteration will last until no more edge can be appended on outputStream. The computation of my transitive closure should trigger periodically for every 5 mins. During the iteration, the inputStream should be "hold" and provide the data for my outputStream.
Is it possible to do this in Flink? Thanks for any help!
This sounds like a side-input issue, where you want to treat the "inputStream" as a batch dataset (with refresh) that's joined to the other "outputStream". Unfortunately Flink doesn't provide an easy way to implement that currently (see https://stackoverflow.com/a/48701829/231762)
If both of these streams are coming from data sources, then one approach is to create a wrapper source that controls the ordering of the records. It would have to emit something like a Tuple2 where one side or the other is null, and then in a downstream (custom) Function you'd essentially split these, and do the joining.
If that's possible, then this source can block the "output" tuples while it emits the "input" tuples, plus other logic it sounds like you need (5 minute refresh, etc). See my response to the other SO issue above for skeleton code that does this.

Local aggregation for data stream in Flink

I'm trying to find a good way to combine Flink keyed WindowedStream locally for Flink application. The idea is to similar to a combiner in MapReduce: to combine partial results in each partition (or mapper) before the data (which is still a keyed WindowedStream) is sent to a global aggregator (or reducer). The closest function I found is: aggregate but I was't be able to find a good example for the usage on WindowedStream.
It looks like aggregate doesn't allow a WindowedStream output. Is there any other way to solve this?
There have been some initiatives to provide pre-aggregation in Flink. You have to implement your own operator. In the case of stream environment you have to extend the class AbstractStreamOperator.
KurtYoung implemented a BundleOperator. You can also use the Table API on top of the stream API. The Table API is already providing a local aggregation. I also have one example of the pre-aggregate operator that I implemented myself. Usually, the drawback of all those solutions is that you have to set the number of items to pre-aggregate or the timeout to pre-aggregate. If you don't have it you can run out of memory, or you never shuffle items (if the threshold number of items is not achieved). In other words, they are rule-based. What I would like to have is something that is cost-based, more dynamic. I would like to have something that adjusts those parameters in run-time.
I hope these links can help you. And, if you have ideas for the cost-based solution, please come to talk with me =).

Which is better: sending many small messages or fewer large ones?

I have an app whose messaging granularity could be written two ways - sending many small messages vs. (possibly far) fewer larger ones. Conceptually what moves around is a set of 'alive' vertex IDs that might get filtered at each superstep based on a processed list (vertex value) that vertexes manage. The ones that survive to the end are the lucky winners. compute() calculates a set of 'new-to-me' incoming IDs that are perfect for the outgoing message, but I could easily send each ID one at a time. My guess is that sending fewer messages is more important, but then each set might contain thousands of IDs. Thank you.
P.S. A side question: The few custom message type examples I've found are relatively simple objects with a few primitive instance variables, rather than collections. Is it nutty to send around a collection of IDs as a message?
I have used lists and even maps to be sent or just stored as vertex data, so that isn’t a problem. I think it shouldn’t matter for giraph which you want to choose, and I’d rather go with many simple small messages, as you will use Giraph appropriately. Instead you will need to go in the compute function through the list of messages and for each message through the list of IDs.
Performance-wise it shouldn’t make any difference. What I’ve rather found to make a big difference is, try to compute as much as possible in on cycle, as the switching between cycles and synchronising the messages, ... takes a lot of time. As long as that doesn’t change it should be more or less the same and probably much easier to read and maintain when you keep the size of messages small.
In order to answer your question, you need understand the MessageStore interface and its implementations.
In a nutshell, under the hood, it took the following steps:
The worker receive the byte raw input of the messages and the destination IDs
The worker sort the messages and put them into A Map of A Map. The first map's key is the partition ID, the section map's key is the vertex ID. (It is kind of like the post office. The work is like the center hub, and it sort the letters into different zip code first, then in each zip code sorted by address)
When it is the vertex's turn of compute, a Iterable of that vertex's messages are passed to the vertex's compute method, and that's where you get the messages and use it.
So less and bigger messages are better because of less sorting if the total amount of bytes is the same for both cases.
Also, you could send many small messages, but let Giraph convert this into a long one (almost) automatically. You can use Combiners.
The documentation on this subject is terrible on Giraph site, but you maybe could extract an example from the book Practical Graph Analytics with Apache Giraph.
This depends on the type of messages that you are sending, mainly.

Resources