Camel: Splitting a collection and writing to files - apache-camel

I´m trying to split an ArrayList and writing each element to it´s own file using Apache Camel like in this simplified example:
from("timer://poll?period=10000").process(new Processor(){
public void process(Exchange exchange){
ArrayList<String> list = new ArrayList<String>();
list.add("one");
list.add("two");
list.add("three");
exchange.getIn().setBody(list, ArrayList.class);
}
}).split(body()).log(body().toString()).to("file:some/dir");
The log prints each item but only "three" is saved to a file. What am I doing wrong?
Jan

After you called split function, your route is divided in 3 ways, each method or route executed after that is applied on each process way.
In each process way, split method add CamelSplitIndex property.
So this code should work
from("timer://poll?period=10000").process(new Processor(){
public void process(Exchange exchange){
ArrayList<String> list = new ArrayList<String>();
list.add("one");
list.add("two");
list.add("three");
exchange.getIn().setBody(list, ArrayList.class);
}
}).split(body()).log(body().toString()).to("file:some/dir?fileName=${header.CamelSplitIndex}");
This is second example with xml file and xpath.
We suppose that you want to explose xml for each node order with an element name inside:
<orders>
<order>
<name>Order 1</name>
</order>
<order>
<name>Order 2</name>
</order>
</order>
We suppose that we want to explode this xml file in 2 files
from("file://repo-source").split(xpath("//orders/order")).setHeader("orderName", xpath("/order/name").stringResult()).to("file://repo?fileName=${header.orderName}.xml");

The file producer will by default "override" if a file already exists.
See the fileExist option at its documentation page
http://camel.apache.org/file2
Since the input to this route is also a file, then the producer will "inherit" the file name from the input.
So in your case if you want to save each splitted message in a new file, then you would need to set a target file name, using the fileName option
"file:some/dir?fileName=splitted-${id}"
The fileName option supports the simple and file language
http://camel.apache.org/simple.html
http://camel.apache.org/file-language.html
That means the file name can be dynamic computed, such as above, where ${id} is an unique message id.

Related

How to generate dynamic path in dataset during the output method

Is there a way to create a dynamic DataSink output path in Flink?
DataSet has data type as Tuple2<String, String>
When we tried using stream I had a way to generate dynamic bath using custom Bucketer like below
#Override
public Path getBucketPath(Clock clock, Path basePath, Tuple2<String, String> element) {
return new Path(basePath + "/schema=" + element.f0.toLowerCase().trim() + "/");
}
I would like to know is there a similar way to handle in DataSet for generating the custom path.
I poked around a bit, and didn't find anything similar for batch processing. Which means I think you'd have to create your own OutputFormat class that wraps a regular FileOutputFormat and does bucketing, using the same Bucketer interface.

Inject a splitter that never aggregates

Camel ver 2.17.3: I want to insert a splitter into a route so that split messages remain split. If I have a "direct" route with a splitter, when control returns from the inner route, I no longer have split messages, only the original.
from("direct:in")
.transform(constant("A,B,C"))
.inOut("direct:inner")
.log("RET-VAL: ${in.body}");
from("direct:inner")
.split()
.tokenize(",")
.log("AFTER-SPLIT ${in.body}")
;
Based on the answer to a similar question, and Claus's comment below, I tried inserting my own aggregator and always marking the group "COMPLETE". Only the last (split) message is being returned to the outer route.
from("direct:in")
.transform(constant("A,B,C"))
.inOut("direct:inner")
.log("RET-VAL: ${in.body}");
from("direct:inner")
.split(body().tokenize(","), new MyAggregationStrategy())
.log("AFTER-SPLIT ${in.body}")
;
public static class MyAggregationStrategy implements AggregationStrategy
{
#Override
public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
System.out.println("Agg called with:"+newExchange.getIn().getBody());
newExchange.setProperty(Exchange.AGGREGATION_COMPLETE_CURRENT_GROUP, true);
return newExchange;
}
}
How do I get the messages to stay split, regardless of how routes are nested etc.?
See this EIP
http://camel.apache.org/composed-message-processor.html
with the splitter only example.
And in the AggregationStrategy you combine together all those splitted sub-messages into one message which is the result you want, eg the outgoing message of the splitter when its done. How you do that depends on your messages and what you want to keep. For example you can put together the sub messages in a List or maybe its XML based and you can append the XML fragments, or something.

Apache Camel - java DSL - transform body to one of its fields

First, I'm fairly new to Camel so if what (or how) I'm trying to do here is dumb, let me know.
CODE:
from("direct:one")
.to("mock:two")
.process(new Processor(){
#Override
public void process(Exchange exchange)throws Exception{
MyCustomObject obj = exchange.getIn().getBody(MyCustomObject.class);
exchange.getOut().setBody(obj.getOneOfTheFields());
}
})
.to("mock:three");
QUESTION:
This processor transforms an object to one of it's fields. I know that I could replace it with simple expression but that would require me to put 'oneOfTheFields' in a string and I don't want to do that.
Is there a shorter way to do this using java code only?
This can be easily achieved using setBody and Camel simple:
from("direct:one")
.to("mock:two")
.setBody(simple("${body.fieldName}"))
.to("mock:three");
You specify the name of the field, and Camel will use the standard accessor mechanism to set the body appropriately.
Can you not simply do this:
from("direct:one")
.to("mock:two")
.setBody(body().getOneOfTheFields())
.to("mock:three");
Let me know if this works.

Write and read file line by line with Camel

I would like to write a byte arrays in a file with Camel. But, in order to get back my arrays, I want to write them line by line, or with another separator.
How to do that with Camel ?
from(somewhere)
.process(new Processor() {
#Override
public void process(final Exchange exchange) throws Exception {
final MyObject body = exchange.getIn().getBody(MyObject.class);
byte[] serializedObject = MySerializer.serialize(body);
exchange.getOut().setBody(serializedObject);
exchange.getOut().setHeader(Exchange.FILE_NAME, "filename");
}
}).to("file://filepath?fileExist=Append&autoCreate=true");
Or is anyone have another way to get them back ?
PS : I need to have only one file, otherwise it would have been too easy ...
EDIT :
I successfully write my file line by line with the out.writeObject method (Thanks to Petter). And I can read them with :
InputStream file = new FileInputStream(FILENAME);
InputStream buffer = new BufferedInputStream(file);
input = new ObjectInputStream(buffer);
Object obj = null;
while ((obj = input.readObject()) != null) {
// Do something
}
But I not able to split and read them with camel. Do you have any idea to read them with Camel ?
It depends on what your serialized object looks like, since you seem to have your own serializer. Is it standard java binary
ByteArrayOutputStream bos = new ByteArrayOuputStream();
ObjectOutput out = new ObjectOutputStream(bos);
out.writeObject(obj);
return bos.toByteArray();
I probably won't be such a great idea to use text based separators like \n.
Can't you serialize into some text format instead? Camel has several easy to use data formats: (http://camel.apache.org/data-format.html). Xstream, for instance, is a line of code or so, to create XML from your objects, then it's not big deal to split the file into several XML parts and read them back with XStream.
In your example, if you really want a separator, why don't you just append it to the byte[]? Copy the array to a new, bigger byte[] and insert some unique sequence in the end.

Training own model in opennlp

I am finding it difficult to create my own model openNLP.
Can any one tell me, how to own model.
How the training shouls be done.
What should be the input and where the output model file will get stored.
https://opennlp.apache.org/docs/1.5.3/manual/opennlp.html
This website is very useful, shows both in code, and using the OpenNLP application to train models for all different types, like entity extraction and part of speech etc.
I could give you some code examples in here, but the page is very clear to use.
Theory-wise:
Essentially you create a file which lists the stuff you want to train
eg.
Sport [whitespace] this is a page about football, rugby and stuff
Politics [whitespace] this is a page about tony blair being prime minister.
The format is described on the page above (each model expects a different format). once you have created this file, you run it through either the API or the opennlp application (via command line), and it generates a .bin file. Once you have this .bin file, you can load it into a model, and start using it (as per the api in the above website).
First you need to train the data with the required Entity.
Sentences should be separated with new line character (\n). Values should be separated from and tags with a space character.
Let's say you want to create medicine entity model, so data should be something like this:
<START:medicine> Augmentin-Duo <END> is a penicillin antibiotic that contains two medicines - <START:medicine> amoxicillin trihydrate <END> and
<START:medicine> potassium clavulanate <END>. They work together to kill certain types of bacteria and are used to treat certain types of bacterial infections.
You can refer a sample dataset for example. Training data should have at least 15000 sentences to get the better results.
Further you can use Opennlp TokenNameFinderTrainer.
Output file will be in the .bin format.
Here is the example: Writing a custom NameFinder model in OpenNLP
For more details, refer the Opennlp documentation
Perhaps this article will help you out. It describes how to do TokenNameFinder training from data extracted from Wikipedia...
nuxeo - blog - Mining Wikipedia with Hadoop and Pig for Natural Language Processing
Copy the data in data and run below code to get your own mymodel.bin .
Can refer for data=https://github.com/mccraigmccraig/opennlp/blob/master/src/test/resources/opennlp/tools/namefind/AnnotatedSentencesWithTypes.txt
public class Training {
static String onlpModelPath = "mymodel.bin";
// training data set
static String trainingDataFilePath = "data.txt";
public static void main(String[] args) throws IOException {
Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream = new PlainTextByLineStream(
new FileInputStream(trainingDataFilePath), charset);
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
lineStream);
TokenNameFinderModel model = null;
HashMap<String, Object> mp = new HashMap<String, Object>();
try {
// model = NameFinderME.train("en","drugs", sampleStream, Collections.<String,Object>emptyMap(),100,4) ;
model= NameFinderME.train("en", "drugs", sampleStream, Collections. emptyMap());
} finally {
sampleStream.close();
}
BufferedOutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(onlpModelPath));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
}
}

Resources