Doing action on last index of Camel aggregation split - apache-camel

Intention is to clear the class level attributes from the EDI850Processor once aggregation last index is reached. But there is no property available to exactly find when we are reaching to the last index of split.
.split().jsonpath("$['ALL']").streaming().aggregate(AggregationStrategies.groupedExchange())
.constant("true")
.completionSize(20)
.completionTimeout(500)
.bean(EDI850Processor.class, "process(*)")
.marshal("edi850").id("marshal-edi850")
// Do some final action if its last index of the split
.choice()
.when(simple("${property.CamelSplitComplete} == 'true'"))
.bean(EDI850Processor.class, "clear(*)")
Properties available on exchange:
{CamelGroupedExchange=List<Exchange>(30 elements),
CamelAggregatedCompletedBy=size,
CamelMessageHistory=[DefaultMessageHistory[routeId=handle-EDI-processing, node=setHeader9], DefaultMessageHistory[routeId=handle-EDI-processing, node=bean5]],
CamelExternalRedelivered=false,
CamelAggregatedCorrelationKey=true,
CamelAggregatedSize=30,
CamelCreatedTimestamp=Tue Nov 05 20:45:03 IST 2019}
Which property can identify the last index?

You could put your EDI850Processor bean in the step immediately after the split. Then it would be called after all the split/aggregate processing is done.

Related

Unique Count for Multiple timewindows - Process or Reduce function combined with ProcessWindowFunction?

We need to find number of unique elements in the input stream for multiple timewindows.
The Input data Object is of below definition InputData(ele1: Integer,ele2: String,ele3: String)
Stream is keyed by ele1 and ele2.The requirement is to find number of unique ele3 in the last 1 hour, last 12 hours and 24 hours and the result should refresh every 15 mins.
We are using SlidingTimewindow with sliding interval as 15 mins and Streaming intervals 1,12 and 24.
Since we need to find Unique elements, we are using Process function as the window function,which would store all the elements(events) for each key till the end of window to process and count unique elements.This,we thought could be optimized for its memory consumption
Instead,we tried using combination of Reduce function and Process function,to incrementaly aggregate,keep storing unique elements in a HashSet in Reduce function and then count the size of the HashSet in Process window function.
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#processwindowfunction-with-incremental-aggregation
public class UserDataReducer implements ReduceFunction<UserData> {
#Override
public UserData reduce(UserData u1, UserData u2) {
u1.getElement3().addAll(u2.getElement3());
return new UserData.Builder(u1.getElement1(), u1.getElement2(),)
.withUniqueUsers(u1.geElement3())
.createUserData();
}
}
public class UserDataProcessor extends ProcessWindowFunction<UserData,Metrics,
Tuple2<Integer, String>,TimeWindow> {
#Override
public void process(Tuple2<Integer, String> key,
ProcessWindowFunction<UserData, Metrics, Tuple2<Integer, String>, TimeWindow>.Context context,
Iterable<UserData> elements,
Collector<Metrics> out) throws Exception {
if (Objects.nonNull(elements.iterator().next())) {
UserData aggregatedUserAttribution = elements.iterator().next();
out.collect(new Metrics(
key.ele1,
key.ele2,
aggregatedUserAttribution.getElement3().size(),
));
}
}
}
We expected the heap memory consumption to reduce,since we are now storing only one object per key per slide as the state.
But there was no decrease in the heap memory consumption,it was almost same or a bit higher.
We observed in the heapdump of the new process, a high number of hashmap instances,consuming more memory than the input data objects would occupy,in the ealrier job.
What would be the best way to solve this? Process function or Incremental aggregation with a combination of Reduce and Process function?
State Backend: Hashmap
Flink Version: 1.14.2 on Yarn
In this case I'm not really sure if partial aggregation will reduce Heap size. It should allow You to reduce state size by some factor depending on the uniqueness of the dataset. That is because (as far as I understand) You are effectively copying HashSet for every single element that is assigned to the window, while they are being garbage collected, it doesn't happen immediately so You will see quite a few of those HashSets in heap dumps.
Overall, ProcessFunction will quite probably generate larger state but in terms of Heap Size they may be quite similar as You have noticed.
One thing You might consider is to try to apply more advanced processing. You can either try to read on Triggers and try to implement a trigger in a such a way that You will have 24h window, but it would emit results for ever y 1h, 12h and 24h (after which the window would be purged). Note that in such case You would need to do some work in ProcessFunction to make sure the results are correct. One more thing You can look at is this post.
Note that both proposed solutions will require some understanding of Flink and more manual processing of window elements.

Gremlin: The provided traverser does not map to a value

g.V()
.has('atom', '_value', 'red').fold()
.coalesce(unfold(), addV('atom').property('_value', 'red')).as('atom')
.out('view').has('view', '_name', 'color').fold()
.coalesce(unfold(), addE('view').from('atom').to(addV('view').property('_name', 'color')))
Gives me an error:
The provided traverser does not map to a value: []->[SelectOneStep(last,atom)] (597)
What does it mean?
Adding to this in case someone else comes across this.
This specific error occurs when you use the id as a string in from() instead of the vertex object.
To see what I mean, as a simple test run the following gremlin query:
g.addE('view').from('atom').to(addV('view').property('_name', 'color'))
then run this query:
g.addE('view').from(V('atom')).to(addV('view').property('_name', 'color'))
The first query will give you the error stated above, the second one will not.
So it looks like when as() is followed by fold() it deletes the variable set in the as() step. I used aggregate() instead as follows:
g.V()
.has('atom', '_value', 'red')
.fold().coalesce(
unfold(),
addV('atom').property('_value', 'red')
)
.aggregate('atom')
.out('view').has('view', '_name', 'color')
.fold().coalesce(
unfold(),
addE('view')
.from(select('atom').unfold())
.to(addV('view').property('_name', 'color'))
.inV()
)
The as() step is what is known as a reducing barrier step. With reducing barrier steps any path history of a query (such as applying a label via as()) is lost. In reducing barrier steps many paths are reduced down to a single path. After that step there would be no way to know which of the many original labeled vertices would be the correct one to retrieve.

Apache Camel: How to use "done" files to identify records written into a file is over and it can be moved

As the title suggests, I want to move a file into a different folder after I am done writing DB records to to it.
I have already looked into several questions related to this: Apache camel file with doneFileName
But my problem is a little different since I am using split, stream and parallelProcessing for getting the DB records and writing to a file. I am not able to know when and how to create the done file along with the parallelProcessing. Here is the code snippet:
My route to fetch records and write it to a file:
from(<ROUTE_FETCH_RECORDS_AND_WRITE>)
.setHeader(Exchange.FILE_PATH, constant("<path to temp folder>"))
.setHeader(Exchange.FILE_NAME, constant("<filename>.txt"))
.setBody(constant("<sql to fetch records>&outputType=StreamList))
.to("jdbc:<endpoint>)
.split(body(), <aggregation>).streaming().parallelProcessing()
.<some processors>
.aggregate(header(Exchange.FILE_NAME), (o, n) -> {
<file aggregation>
return o;
}).completionInterval(<some time interval>)
.toD("file://<to the temp file>")
.end()
.end()
.to("file:"+<path to temp folder>+"?doneFileName=${file:header."+Exchange.FILE_NAME+"}.done"); //this line is just for trying out done filename
In my aggregation strategy for the splitter I have code that basically counts records processed and prepares the response that would be sent back to the caller.
And in my other aggregate outside I have code for aggregating the db rows and post that writing into the file.
And here is the file listener for moving the file:
from("file://<path to temp folder>?delete=true&include=<filename>.*.TXT&doneFileName=done")
.to(file://<final filename with path>?fileExist=Append);
Doing something like this is giving me this error:
Caused by: [org.apache.camel.component.file.GenericFileOperationFailedException - Cannot store file: <folder-path>/filename.TXT] org.apache.camel.component.file.GenericFileOperationFailedException: Cannot store file: <folder-path>/filename.TXT
at org.apache.camel.component.file.FileOperations.storeFile(FileOperations.java:292)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.file.GenericFileProducer.writeFile(GenericFileProducer.java:277)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.file.GenericFileProducer.processExchange(GenericFileProducer.java:165)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.file.GenericFileProducer.process(GenericFileProducer.java:79)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.util.AsyncProcessorConverterHelper$ProcessorToAsyncProcessorBridge.process(AsyncProcessorConverterHelper.java:61)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.SendProcessor.process(SendProcessor.java:141)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:77)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.RedeliveryErrorHandler.process(RedeliveryErrorHandler.java:460)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.Pipeline.process(Pipeline.java:121)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.Pipeline.process(Pipeline.java:83)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.seda.SedaConsumer.sendToConsumers(SedaConsumer.java:298)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.seda.SedaConsumer.doRun(SedaConsumer.java:207)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.seda.SedaConsumer.run(SedaConsumer.java:154)[209:org.apache.camel.camel-core:2.16.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)[:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)[:1.8.0_144]
at java.lang.Thread.run(Thread.java:748)[:1.8.0_144]
Caused by: org.apache.camel.InvalidPayloadException: No body available of type: java.io.InputStream but has value: Total number of records discovered: 5
What am I doing wrong? Any inputs will help.
PS: Newly introduced to Apache Camel
I would guess that the error comes from .toD("file://<to the temp file>") trying to write a file, but finds the wrong type of body (String Total number of records discovered: 5 instead of InputStream.
I don't understand why you have one file-destinations inside the splitter and one outside of it.
As #claus-ibsen suggested try to remove this extra .aggregate(...) in your route. To split and re-aggregate it is sufficient to reference the aggregation strategy in the splitter. Claus also pointed to an example in the Camel docs
from(<ROUTE_FETCH_RECORDS_AND_WRITE>)
.setHeader(Exchange.FILE_PATH, constant("<path to temp folder>"))
.setHeader(Exchange.FILE_NAME, constant("<filename>.txt"))
.setBody(constant("<sql to fetch records>&outputType=StreamList))
.to("jdbc:<endpoint>)
.split(body(), <aggregationStrategy>)
.streaming().parallelProcessing()
// the processors below get individual parts
.<some processors>
.end()
// The end statement above ends split-and-aggregate. From here
// you get the re-aggregated result of the splitter.
// So you can simply write it to a file and also write the done-file
.to(...);
However, if you need to control the aggregation sizes, you have to combine splitter and aggregator. That would look somehow like this
from(<ROUTE_FETCH_RECORDS_AND_WRITE>)
.setHeader(Exchange.FILE_PATH, constant("<path to temp folder>"))
.setHeader(Exchange.FILE_NAME, constant("<filename>.txt"))
.setBody(constant("<sql to fetch records>&outputType=StreamList))
.to("jdbc:<endpoint>)
// No aggregationStrategy here so it is a standard splitter
.split(body())
.streaming().parallelProcessing()
// the processors below get individual parts
.<some processors>
.end()
// The end statement above ends split. From here
// you still got individual records from the splitter.
.to(seda:aggregate);
// new route to do the controlled aggregation
from("seda:aggregate")
// constant(true) is the correlation predicate => collect all messages in 1 aggregation
.aggregate(constant(true), new YourAggregationStrategy())
.completionSize(500)
// not sure if this 'end' is needed
.end()
// write files with 500 aggregated records here
.to("...");

Camel Splitter store CamelSplitSize and processed rows on failure

http://camel.apache.org/splitter.html [1]
From [1] link i saw CamelSplitSize will be on the completed Exchange.
I am learning camel and i would like to know is there a way possible to split the xml file containing 100 rows (assuming 100 rows)
If the split failed while processing the 50th row and we need to show CamelSplitIndex as 50, CamelSplitSize as 100 and CamelSplitComplete as false
.bean(Splitter.class,"saveFile("${camelContext.properties[mySplitSize]}, ${camelContext.properties[mySplitIndex]}, ${camelContext.properties[mySplitComplete]})")
I could not find a way to accomplish this as the link [1] clearly states CamelSplitSize will be stored only on the completed Exchang. Any way to achieve this ??
If you need this properties you can catch exception that causes stopping splitter and get exchange which caused the exception. There you will find your properties.
public void show(Exchange exchange) {
CamelExchangeException camelExceptionCaught = (CamelExchangeException) exchange.getProperty("CamelExceptionCaught");
System.out.println(camelExceptionCaught.getExchange().getProperty("CamelSplitSize"));
System.out.println(camelExceptionCaught.getExchange().getProperty("CamelSplitComplete"));
System.out.println(camelExceptionCaught.getExchange().getProperty("CamelSplitIndex"));
}

GoogleAppEngine - query with some custom filter

I am quite new with appEnginy and objectify. However I need to fetch a single row from db to get some value from it. I tried to fetch element by ofy().load().type(Branch.class).filter("parent_branch_id", 0).first() but the result is FirstRef(null). However though when I run following loop:
for(Branch b : ofy().load().type(Branch.class).list()) {
System.out.println(b.id +". "+b.tree_label+" - parent is " +b.parent_branch_id);
};
What do I do wrong?
[edit]
Ofcourse Branch is a database entity, if it matters parent_branch_id is of type long.
If you want a Branch as the result of your request, I think you miss a .now():
Branch branch = ofy().load().type(Branch.class).filter("parent_branch_id", 0).first().now();
It sounds like you don't have an #Index annotation on your parent_branch_id property. When you do ofy().load().type(Branch.class).list(), Objectify is effectively doing a batch get by kind (like doing Query("Branch") with the low-level API) so it doesn't need the property indexes. As soon as you add a filter(), it uses a query.
Assuming you are using Objectify 4, properties are not indexed by default. You can index all the properties in your entity by adding an #Index annotation to the class. The annotation reference provides useful info.
Example from the Objectify API reference:
LoadResult<Thing> th = ofy.load().type(Thing.class).filter("foo", foo).first();
Thing th = ofy.load().type(Thing.class).filter("foo", foo).first().now();
So you need to make sure member "foo" has an #Index and use the now() to fetch the first element. This will return a null if no element is found.
May be "parent_branch_id"in your case is a long, in which case the value must be 0L and not 0.

Resources