How to aggregate many marshalled (json) objects to one file

How to aggregate many marshalled (json) objects to one file - apache-camel

I created a route to buffer/store marshaled objects (json) into files. This route (and the other route to read the buffer) work fine.
storing in buffer:
from(DIRECT_IN).marshal().json().marshal().gzip().to(fileTarget());
reading from buffer:
from(fileTarget()).unmarshal().gzip().unmarshal().json().to("mock:a")
To reduce i/o i want to aggregate many exchanges in one file. I tried to just aggregate after json and before it so i added this after json() or from(...):
.aggregate(constant(true)).completionSize(20).completionTimeout(1000).groupExchanges()
In both cases i get conversion exceptions. How to do it correctly? I would prefer a way without custom aggregator. And it would be nice if just many exchanges/object are aggregated in one json (as list of objects) or in one text file - one json object per line.
Thanks in advance.

Meanwhile i added a simple aggreagtor:
public class LineAggregator implements AggregationStrategy {
#Override
public final Exchange aggregate(final Exchange oldExchange, final Exchange newExchange) {
//if first message of aggregation
if (oldExchange == null) {
return newExchange;
}
//else aggregate
String oldBody = oldExchange.getIn().getBody(String.class);
String newBody = newExchange.getIn().getBody(String.class);
String aggregate = oldBody + System.lineSeparator() + newBody;
oldExchange.getIn().setBody(aggregate);
return oldExchange;
}
}
The routes look like that, to buffer:
from(...)// marshal objects to json
.marshal()
.json()
.aggregate(constant(true), lineAggregator)
.completionSize(BUFFER_PACK_SIZE)
.completionTimeout(BUFFER_PACK_TIMEOUT)
.marshal()
.gzip()
.to(...)
from buffer:
from(...).unmarshal()
.gzip()
.split()
.tokenize("\r\n|\n|\r")
.unmarshal()
.json()
.to(....)
But the question remains, is the aggregator necessary?

Related

query by object value inside array on firebase firestore [duplicate]

This is my structure of the firestore database:
Expected result: to get all the jobs, where in the experience array, the lang value is "Swift".
So as per this I should get first 2 documents. 3rd document does not have experience "Swift".
Query jobs = db.collection("Jobs").whereArrayContains("experience.lang","Swift");
jobs.get().addOnSuccessListener(new OnSuccessListener<QuerySnapshot>() {
#Override
public void onSuccess(QuerySnapshot queryDocumentSnapshots) {
//Always the queryDocumentSnapshots size is 0
}
});
Tried most of the answers but none worked out. Is there any way to query data in this structure? The docs only available for normal array. Not available for array of custom object.

Actually it is possible to perform such a query when having a database structure like yours. I have replicated your schema and here are document1, document2, and document3.
Note that you cannot query using partial (incomplete) data. You are using only the lang property to query, which is not correct. You should use an object that contains both properties, lang and years.
Seeing your screenshot, at first glance, the experience array is a list of HashMap objects. But here comes the nicest part, that list can be simply mapped into a list of custom objects. Let's try to map each object from the array to an object of type Experience. The model contains only two properties:
public class Experience {
public String lang, years;
public Experience() {}
public Experience(String lang, String years) {
this.lang = lang;
this.years = years;
}
}
I don't know how you named the class that represents a document, but I named it simply Job. To keep it simple, I have only used two properties:
public class Job {
public String name;
public List<Experience> experience;
//Other prooerties
public Job() {}
}
Now, to perform a search for all documents that contain in the array an object with the lang set to Swift, please follow the next steps. First, create a new object of the Experience class:
Experience firstExperience = new Experience("Swift", "1");
Now you can query like so:
CollectionReference jobsRef = rootRef.collection("Jobs");
jobsRef.whereArrayContains("experience", firstExperience).get().addOnCompleteListener(new OnCompleteListener<QuerySnapshot>() {
#Override
public void onComplete(#NonNull Task<QuerySnapshot> task) {
if (task.isSuccessful()) {
for (QueryDocumentSnapshot document : task.getResult()) {
Job job = document.toObject(Job.class);
Log.d(TAG, job.name);
}
} else {
Log.d(TAG, task.getException().getMessage());
}
}
});
The result in the logcat will be the name of document1 and document2:
firstJob
secondJob
And this is because only those two documents contain in the array an object where the lang is set to Swift.
You can also achieve the same result when using a Map:
Map<String, Object> firstExperience = new HashMap<>();
firstExperience.put("lang", "Swift");
firstExperience.put("years", "1");
So there is no need to duplicate data in this use-case. I have also written an article on the same topic
How to map an array of objects from Cloud Firestore to a List of objects?
Edit:
In your approach it provides the result only if expreience is "1" and lang is "Swift" right?
That's correct, it only searches for one element. However, if you need to query for more than that:
Experience firstExperience = new Experience("Swift", "1");
Experience secondExperience = new Experience("Swift", "4");
//Up to ten
We use another approach, which is actually very simple. I'm talking about Query's whereArrayContainsAny() method:
Creates and returns a new Query with the additional filter that documents must contain the specified field, the value must be an array, and that the array must contain at least one value from the provided list.
And in code should look like this:
jobsRef.whereArrayContainsAny("experience", Arrays.asList(firstExperience, secondExperience)).get().addOnCompleteListener(new OnCompleteListener<QuerySnapshot>() {
#Override
public void onComplete(#NonNull Task<QuerySnapshot> task) {
if (task.isSuccessful()) {
for (QueryDocumentSnapshot document : task.getResult()) {
Job job = document.toObject(Job.class);
Log.d(TAG, job.name);
}
} else {
Log.d(TAG, task.getException().getMessage());
}
}
});
The result in the logcat will be:
firstJob
secondJob
thirdJob
And this is because all three documents contain one or the other object.
Why am I talking about duplicating data in a document it's because the documents have limits. So there are some limits when it comes to how much data you can put into a document. According to the official documentation regarding usage and limits:
Maximum size for a document: 1 MiB (1,048,576 bytes)
As you can see, you are limited to 1 MiB total of data in a single document. So storing duplicated data will only increase the change to reach the limit.
If i send null data of "exprience" and "swift" as "lang" will it be queried?
No, will not work.
Edit2:
whereArrayContainsAny() method works with max 10 objects. If you have 30, then you should save each query.get() of 10 objects into a Task object and then pass them one by one to the to the Tasks's whenAllSuccess(Task... tasks).
You can also pass them directly as a list to Tasks's whenAllSuccess(Collection> tasks) method.

With your current document structure, it's not possible to perform the query you want. Firestore does not allow queries for individual fields of objects in list fields.
What you would have to do is create an additional field in your document that is queryable. For example, you could create a list field with only the list of string languages that are part of the document. With this, you could use an array-contains query to find the documents where a language is mentioned at least once.
For the document shown in your screenshot, you would have a list field called "languages" with values ["Swift", "Kotlin"].

Proper way to assign watermark with DateStreamSource<List<T>> using Flink

I have a continuing JSONArray data produced to Kafka topic,and I wanna process records with EventTime characteristic.In order to reach this goal,I have to assign watermark to each record which contained in the JSONArray.
I didn't find a convenience way to achieve this goal.My solution is consuming data from DataStreamSource> ,then iterate List and collect Object to downstream with an anonymous ProcessFunction,finally assign watermark to the this downstream.
The major code shows below:
DataStreamSource<List<MockData>> listDataStreamSource = KafkaSource.genStream(env);
SingleOutputStreamOperator<MockData> convertToPojo = listDataStreamSource
.process(new ProcessFunction<List<MockData>, MockData>() {
#Override
public void processElement(List<MockData> value, Context ctx, Collector<MockData> out)
throws Exception {
value.forEach(mockData -> out.collect(mockData));
}
});
convertToPojo.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor<MockData>(Time.seconds(5)) {
#Override
public long extractTimestamp(MockData element) {
return element.getTimestamp();
}
});
SingleOutputStreamOperator<Tuple2<String, Long>> countStream = convertToPojo
.keyBy("country").window(
SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(10)))
.process(
new FlinkEventTimeCountFunction()).name("count elements");
The code seems all right without doubt,running without error as well.But ProcessWindowFunction never triggered.I tracked the Flink source code,find EventTimeTrigger never returns TriggerResult.FIRE,causing by TriggerContext.getCurrentWatermark returns Long.MIN_VALUE all the time.
What's the proper way to process List in eventtime?Any suggestion will be appreciated.

The problem is that you are applying the keyBy and window operations to the convertToPojo stream, rather than the stream with timestamps and watermarks (which you didn't assign to a variable).
If you write the code more or less like this, it should work:
listDataStreamSource = KafkaSource ...
convertToPojo = listDataStreamSource.process ...
pojoPlusWatermarks = convertToPojo.assignTimestampsAndWatermarks ...
countStream = pojoPlusWatermarks.keyBy ...
Calling assignTimestampsAndWatermarks on the convertToPojo stream does not modify that stream, but rather creates a new datastream object that includes timestamps and watermarks. You need to apply your windowing to that new datastream.

Inject a splitter that never aggregates

Camel ver 2.17.3: I want to insert a splitter into a route so that split messages remain split. If I have a "direct" route with a splitter, when control returns from the inner route, I no longer have split messages, only the original.
from("direct:in")
.transform(constant("A,B,C"))
.inOut("direct:inner")
.log("RET-VAL: ${in.body}");
from("direct:inner")
.split()
.tokenize(",")
.log("AFTER-SPLIT ${in.body}")
;
Based on the answer to a similar question, and Claus's comment below, I tried inserting my own aggregator and always marking the group "COMPLETE". Only the last (split) message is being returned to the outer route.
from("direct:in")
.transform(constant("A,B,C"))
.inOut("direct:inner")
.log("RET-VAL: ${in.body}");
from("direct:inner")
.split(body().tokenize(","), new MyAggregationStrategy())
.log("AFTER-SPLIT ${in.body}")
;
public static class MyAggregationStrategy implements AggregationStrategy
{
#Override
public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
System.out.println("Agg called with:"+newExchange.getIn().getBody());
newExchange.setProperty(Exchange.AGGREGATION_COMPLETE_CURRENT_GROUP, true);
return newExchange;
}
}
How do I get the messages to stay split, regardless of how routes are nested etc.?

See this EIP
http://camel.apache.org/composed-message-processor.html
with the splitter only example.
And in the AggregationStrategy you combine together all those splitted sub-messages into one message which is the result you want, eg the outgoing message of the splitter when its done. How you do that depends on your messages and what you want to keep. For example you can put together the sub messages in a List or maybe its XML based and you can append the XML fragments, or something.

How do I setup a streamed set of SQL Inserts in Apache Camel

I have a file with over 3 million pipe-delimited rows that I want to insert into a database. Its a simple table (no normalisation required)
Setting up the route to watch for the file, read it in using streaming mode and split the lines is easy. Inserting rows into the table will also be a simple wiring job.
Question is: how can I do this using batched inserts? Lets say that 1000 rows is optimal.. given that the file is streamed how would the SQL component know that the stream had finished. Lets say the file had 3,000,001 records. How can I set Camel up to insert the last stray record?
Inserting the lines one at a time can be done - but this will be horribly slow.

I would recommend something like this:
from("file:....")
.split("\n").streaming()
.to("any work for individual level")
.aggregate(body(), new MyAggregationStrategy().completionSize(1000).completionTimeout(50)
.to(sql:......);
I didn't validate all the syntax, but the plan would be to grab the file split it with streams, then aggregate groups of 1000 and have a timeout to catch that last smaller group. Those aggregated groups could simply make the body a list of strings or whatever format you will need for your batch sql insert.

Here is more accurate example:
#Component
#Slf4j
public class SQLRoute extends RouteBuilder {
#Autowired
ListAggregationStrategy aggregationStrategy;
#Override
public void configure() throws Exception {
from("timer://runOnce?repeatCount=1&delay=0")
.to("sql:classpath:sql/orders.sql?outputType=StreamList")
.split(body()).streaming()
.aggregate(constant(1), aggregationStrategy).completionSize(1000).completionTimeout(500)
.to("log:batch")
.to("google-bigquery:google_project:import:orders")
.end()
.end();
}
#Component
class ListAggregationStrategy implements AggregationStrategy {
public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
List rows = null;
if (oldExchange == null) {
// First row ->
rows = new LinkedList();
rows.add(newExchange.getMessage().getBody());
newExchange.getMessage().setBody(rows);
return newExchange;
}
rows = oldExchange.getIn().getBody(List.class);
Map newRow = newExchange.getIn().getBody(Map.class);
log.debug("Current rows count: {} ", rows.size());
log.debug("Adding new row: {}", newRow);
rows.add(newRow);
oldExchange.getIn().setBody(rows);
return oldExchange;
}
}
}

This can be done using the Camel-Spring-batch component. http://camel.apache.org/springbatch.html , the volume of commit per step can be defined by the commitInterval and the orchestration of the job is defined in a spring config. It works quite for well for usecases similar to your requirement.
Here's a nice example from github : https://github.com/hekonsek/fuse-pocs/tree/master/fuse-pocs-springdm-springbatch/fuse-pocs-springdm-springbatch-bundle/src/main

Apache Camel's Aggregator2 to aggregate XML documents into a single huge document

Is that possible to aggregate multiple small XML documents:
<doc><field name="XXX">fieldValue</field><doc>
using aggregator2 (camel 2.7.0) into one big document
<result><doc>...</doc><doc>...</doc><doc>...</doc>...<doc>...</doc></result>
without using some custom aggregator processor? I've managed to get it done creating custom aggregator, but now I'm simplifying my code, so would like to get rid of it if camel supports that out of the box.
My custom aggregator looks like:
class DocsAggregator implements Processor {
void process(Exchange exchange) {
def builder = DocumentBuilderFactory.newInstance().newDocumentBuilder()
def Document parentDoc = builder.parse(new ByteArrayInputStream("<?xml version='1.0'?><add></add>".toString().bytes));
def groupedExchanges = exchange.properties.find {it.key == 'CamelGroupedExchange'}
groupedExchanges.value.each { Exchange x ->
def Document document = x.'in'.body
def bos = new ByteArrayOutputStream()
TransformerFactory.newInstance().newTransformer().transform(new DOMSource(document), new StreamResult(bos))
def node = document.documentElement.childNodes.find { Node it -> it.nodeType == Node.ELEMENT_NODE}
def cloned = parentDoc.adoptNode(node)
parentDoc.documentElement.appendChild(cloned)
}
exchange.in.body = parentDoc
}
}

Okay so you are using grouped exchange option. Then its a bit different. The data is stored as a property on the exchange as a list.
Instead of the processor you can use a POJO and bind a parameter to the property. But the List still contains Exchange objects so you need to invoke the getIn().getBody() methods on it. But if you do it like this you don't need to import any Camel API in the POJO.
public Document mergeMyStuff(#Property("CamelGroupedExchange") List grouped) {
Document parent = ...
for (int i = 0; i < grouped.size; i++) {
Document doc = list.get(i).getIn().getBody(Documemt.class);
.. add to parent doc
}
return parent;
}

By custom aggregator processor, you mean a custom AggregationStrategy? If that's the case, then no. Currently that's required.
We have on the roadmap to offer a pojo model for the aggregation so you dont need to use that Camel API. So expect this to be simpler in the future.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to aggregate many marshalled (json) objects to one file - apache-camel

Related

query by object value inside array on firebase firestore [duplicate]

Proper way to assign watermark with DateStreamSource<List<T>> using Flink

Inject a splitter that never aggregates

How do I setup a streamed set of SQL Inserts in Apache Camel

Apache Camel's Aggregator2 to aggregate XML documents into a single huge document

Categories

Resources