How do I setup a streamed set of SQL Inserts in Apache Camel - apache-camel

I have a file with over 3 million pipe-delimited rows that I want to insert into a database. Its a simple table (no normalisation required)
Setting up the route to watch for the file, read it in using streaming mode and split the lines is easy. Inserting rows into the table will also be a simple wiring job.
Question is: how can I do this using batched inserts? Lets say that 1000 rows is optimal.. given that the file is streamed how would the SQL component know that the stream had finished. Lets say the file had 3,000,001 records. How can I set Camel up to insert the last stray record?
Inserting the lines one at a time can be done - but this will be horribly slow.

I would recommend something like this:
from("file:....")
.split("\n").streaming()
.to("any work for individual level")
.aggregate(body(), new MyAggregationStrategy().completionSize(1000).completionTimeout(50)
.to(sql:......);
I didn't validate all the syntax, but the plan would be to grab the file split it with streams, then aggregate groups of 1000 and have a timeout to catch that last smaller group. Those aggregated groups could simply make the body a list of strings or whatever format you will need for your batch sql insert.

Here is more accurate example:
#Component
#Slf4j
public class SQLRoute extends RouteBuilder {
#Autowired
ListAggregationStrategy aggregationStrategy;
#Override
public void configure() throws Exception {
from("timer://runOnce?repeatCount=1&delay=0")
.to("sql:classpath:sql/orders.sql?outputType=StreamList")
.split(body()).streaming()
.aggregate(constant(1), aggregationStrategy).completionSize(1000).completionTimeout(500)
.to("log:batch")
.to("google-bigquery:google_project:import:orders")
.end()
.end();
}
#Component
class ListAggregationStrategy implements AggregationStrategy {
public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
List rows = null;
if (oldExchange == null) {
// First row ->
rows = new LinkedList();
rows.add(newExchange.getMessage().getBody());
newExchange.getMessage().setBody(rows);
return newExchange;
}
rows = oldExchange.getIn().getBody(List.class);
Map newRow = newExchange.getIn().getBody(Map.class);
log.debug("Current rows count: {} ", rows.size());
log.debug("Adding new row: {}", newRow);
rows.add(newRow);
oldExchange.getIn().setBody(rows);
return oldExchange;
}
}
}

This can be done using the Camel-Spring-batch component. http://camel.apache.org/springbatch.html , the volume of commit per step can be defined by the commitInterval and the orchestration of the job is defined in a spring config. It works quite for well for usecases similar to your requirement.
Here's a nice example from github : https://github.com/hekonsek/fuse-pocs/tree/master/fuse-pocs-springdm-springbatch/fuse-pocs-springdm-springbatch-bundle/src/main

Related

Proper way to assign watermark with DateStreamSource<List<T>> using Flink

I have a continuing JSONArray data produced to Kafka topic,and I wanna process records with EventTime characteristic.In order to reach this goal,I have to assign watermark to each record which contained in the JSONArray.
I didn't find a convenience way to achieve this goal.My solution is consuming data from DataStreamSource> ,then iterate List and collect Object to downstream with an anonymous ProcessFunction,finally assign watermark to the this downstream.
The major code shows below:
DataStreamSource<List<MockData>> listDataStreamSource = KafkaSource.genStream(env);
SingleOutputStreamOperator<MockData> convertToPojo = listDataStreamSource
.process(new ProcessFunction<List<MockData>, MockData>() {
#Override
public void processElement(List<MockData> value, Context ctx, Collector<MockData> out)
throws Exception {
value.forEach(mockData -> out.collect(mockData));
}
});
convertToPojo.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor<MockData>(Time.seconds(5)) {
#Override
public long extractTimestamp(MockData element) {
return element.getTimestamp();
}
});
SingleOutputStreamOperator<Tuple2<String, Long>> countStream = convertToPojo
.keyBy("country").window(
SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(10)))
.process(
new FlinkEventTimeCountFunction()).name("count elements");
The code seems all right without doubt,running without error as well.But ProcessWindowFunction never triggered.I tracked the Flink source code,find EventTimeTrigger never returns TriggerResult.FIRE,causing by TriggerContext.getCurrentWatermark returns Long.MIN_VALUE all the time.
What's the proper way to process List in eventtime?Any suggestion will be appreciated.
The problem is that you are applying the keyBy and window operations to the convertToPojo stream, rather than the stream with timestamps and watermarks (which you didn't assign to a variable).
If you write the code more or less like this, it should work:
listDataStreamSource = KafkaSource ...
convertToPojo = listDataStreamSource.process ...
pojoPlusWatermarks = convertToPojo.assignTimestampsAndWatermarks ...
countStream = pojoPlusWatermarks.keyBy ...
Calling assignTimestampsAndWatermarks on the convertToPojo stream does not modify that stream, but rather creates a new datastream object that includes timestamps and watermarks. You need to apply your windowing to that new datastream.

Is there an equivalent to Kafka's KTable in Apache Flink?

Apache Kafka has a concept of a KTable, where
where each data record represents an update
Essentially, I can consume a kafka topic, and only keep the latest message for per key.
Is there a similar concept available in Apache Flink? I have read about Flink's Table API, but does not seem to be solving the same problem.
Some help comparing and contrasting the 2 frameworks would be helpful. I am not looking for which is better or worse. But rather just how they differ. The answer for which is right would then depend on my requirements.
You are right. Flink's Table API and its Table class do not correspond to Kafka's KTable. The Table API is a relational language-embedded API (think of SQL integrated in Java and Scala).
Flink's DataStream API does not have a built-in concept that corresponds to a KTable. Instead, Flink offers sophisticated state management and a KTable would be a regular operator with keyed state.
For example, a stateful operator with two inputs that stores the latest value observed from the first input and joins it with values from the second input, can be implemented with a CoFlatMapFunction as follows:
DataStream<Tuple2<Long, String>> first = ...
DataStream<Tuple2<Long, String>> second = ...
DataStream<Tuple2<String, String>> result = first
// connect first and second stream
.connect(second)
// key both streams on the first (Long) attribute
.keyBy(0, 0)
// join them
.flatMap(new TableLookup());
// ------
public static class TableLookup
extends RichCoFlatMapFunction<Tuple2<Long,String>, Tuple2<Long,String>, Tuple2<String,String>> {
// keyed state
private ValueState<String> lastVal;
#Override
public void open(Configuration conf) {
ValueStateDescriptor<String> valueDesc =
new ValueStateDescriptor<String>("table", Types.STRING);
lastVal = getRuntimeContext().getState(valueDesc);
}
#Override
public void flatMap1(Tuple2<Long, String> value, Collector<Tuple2<String, String>> out) throws Exception {
// update the value for the current Long key with the String value.
lastVal.update(value.f1);
}
#Override
public void flatMap2(Tuple2<Long, String> value, Collector<Tuple2<String, String>> out) throws Exception {
// look up latest String for current Long key.
String lookup = lastVal.value();
// emit current String and looked-up String
out.collect(Tuple2.of(value.f1, lookup));
}
}
In general, state can be used very flexibly with Flink and let's you implement a wide range of use cases. There are also more state types, such as ListState and MapState and with a ProcessFunction you have fine-grained control over time, for example to remove the state of a key if it has not been updated for a certain amount of time (KTables have a configuration for that as far as I know).

Inject a splitter that never aggregates

Camel ver 2.17.3: I want to insert a splitter into a route so that split messages remain split. If I have a "direct" route with a splitter, when control returns from the inner route, I no longer have split messages, only the original.
from("direct:in")
.transform(constant("A,B,C"))
.inOut("direct:inner")
.log("RET-VAL: ${in.body}");
from("direct:inner")
.split()
.tokenize(",")
.log("AFTER-SPLIT ${in.body}")
;
Based on the answer to a similar question, and Claus's comment below, I tried inserting my own aggregator and always marking the group "COMPLETE". Only the last (split) message is being returned to the outer route.
from("direct:in")
.transform(constant("A,B,C"))
.inOut("direct:inner")
.log("RET-VAL: ${in.body}");
from("direct:inner")
.split(body().tokenize(","), new MyAggregationStrategy())
.log("AFTER-SPLIT ${in.body}")
;
public static class MyAggregationStrategy implements AggregationStrategy
{
#Override
public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
System.out.println("Agg called with:"+newExchange.getIn().getBody());
newExchange.setProperty(Exchange.AGGREGATION_COMPLETE_CURRENT_GROUP, true);
return newExchange;
}
}
How do I get the messages to stay split, regardless of how routes are nested etc.?
See this EIP
http://camel.apache.org/composed-message-processor.html
with the splitter only example.
And in the AggregationStrategy you combine together all those splitted sub-messages into one message which is the result you want, eg the outgoing message of the splitter when its done. How you do that depends on your messages and what you want to keep. For example you can put together the sub messages in a List or maybe its XML based and you can append the XML fragments, or something.

Spring data : CrudRepository's save method and update

I wanted to know if the {save} method in CrudRepository do an update if it finds already the entry in the database like :
#Repository
public interface ProjectDAO extends CrudRepository<Project, Integer> {}
#Service
public class ProjectServiceImpl {
#Autowired private ProjectDAO pDAO;
public void save(Project p) { pDAO.save(p); } }
So if I call that method on an already registred entry, it'll update it if it finds a changed attribute ?
Thanks.
I wanted to know if the {save} method in CrudRepository do an update
if it finds already the entry in the database
The Spring documentation about it is not precise :
Saves a given entity. Use the returned instance for further operations
as the save operation might have changed the entity instance
completely.
But as the CrudRepository interface doesn't propose another method with an explicit naming for updating an entity, we may suppose that yes since CRUD is expected to do all CRUD operations (CREATE, READ, UPDATE, DELETE).
This supposition is confirmed by the implementation of the SimpleJpaRepository
class which is the default implementation of CrudRepository which shows that both cases are handled by the method :
#Transactional
public <S extends T> S save(S entity) {
if (entityInformation.isNew(entity)) {
em.persist(entity);
return entity;
} else {
return em.merge(entity);
}
}
So if I call that method on an already registered entry, it'll update
it if it finds a changed attribute?
It will do a merge operation in this case. So all fields are updated according to how the merging cascade and read-only option are set.
Looking at the default implemantation of CrudRepository interface
/*
* (non-Javadoc)
* #see org.springframework.data.repository.CrudRepository#save(java.lang.Object)
*/
#Transactional
public <S extends T> S save(S entity) {
if (entityInformation.isNew(entity)) {
em.persist(entity);
return entity;
} else {
return em.merge(entity);
}
}
Save method manage two situations:
-If the person Id is null (a new entity is created) then save will call persist method => insert query will be executed.
-If the person id is not null then save will call merge: fetch the existing entity from entityManagerFactory(from the 2 level cache if it doesn't exist then it will be fetched from the database) and comparing the detached entity with the managed and finally propagate the changes to the database by calling update query.
To be precise, the save(obj) method will treat obj as a new record if the id is empty (therefore will do an insert) and will treat obj as an existing record if the id is filled in (therefore will do the merge).
Why is this important?
Let's say the Project object contains an auto-generated id and also a person_id which must be unique. You make a Project object and fill in the person_id but not the id and then try to save. Hibernate will try to insert this record, since the id is empty, but if that person exists in the database already, you will get a duplicate key exception.
How to handle
Either do a findByPersonId(id) to check if the obj is in the db already, and get the id from that if it is found,
Or just try the save and catch the exception in which case you know it's in the db already and you need to get and set the id before saving.
I wanted to know if the {save} method in CrudRepository do an update if it finds already the entry in the database:
The Answer is Yes, It will update if it finds an entry:
From Spring Documentation: Herehttps://docs.spring.io/spring-data/jpa/docs/1.5.0.RELEASE/reference/html/jpa.repositories.html?
Saving an entity can be performed via the CrudRepository.save(…)-Method. It will persist or merge the given entity using the underlying JPA EntityManager. If the entity has not been persisted yet Spring Data JPA will save the entity via a call to the entityManager.persist(…)-Method, otherwise the entityManager.merge(…)-Method will be called.
In my case I had to add the id property to the Entity, and put the annotation #Id like this.
#Id
private String id;
This way when you get the object has the Id of the entity in the database, and does the Update operation instead of the Create.

How to aggregate many marshalled (json) objects to one file

I created a route to buffer/store marshaled objects (json) into files. This route (and the other route to read the buffer) work fine.
storing in buffer:
from(DIRECT_IN).marshal().json().marshal().gzip().to(fileTarget());
reading from buffer:
from(fileTarget()).unmarshal().gzip().unmarshal().json().to("mock:a")
To reduce i/o i want to aggregate many exchanges in one file. I tried to just aggregate after json and before it so i added this after json() or from(...):
.aggregate(constant(true)).completionSize(20).completionTimeout(1000).groupExchanges()
In both cases i get conversion exceptions. How to do it correctly? I would prefer a way without custom aggregator. And it would be nice if just many exchanges/object are aggregated in one json (as list of objects) or in one text file - one json object per line.
Thanks in advance.
Meanwhile i added a simple aggreagtor:
public class LineAggregator implements AggregationStrategy {
#Override
public final Exchange aggregate(final Exchange oldExchange, final Exchange newExchange) {
//if first message of aggregation
if (oldExchange == null) {
return newExchange;
}
//else aggregate
String oldBody = oldExchange.getIn().getBody(String.class);
String newBody = newExchange.getIn().getBody(String.class);
String aggregate = oldBody + System.lineSeparator() + newBody;
oldExchange.getIn().setBody(aggregate);
return oldExchange;
}
}
The routes look like that, to buffer:
from(...)// marshal objects to json
.marshal()
.json()
.aggregate(constant(true), lineAggregator)
.completionSize(BUFFER_PACK_SIZE)
.completionTimeout(BUFFER_PACK_TIMEOUT)
.marshal()
.gzip()
.to(...)
from buffer:
from(...).unmarshal()
.gzip()
.split()
.tokenize("\r\n|\n|\r")
.unmarshal()
.json()
.to(....)
But the question remains, is the aggregator necessary?

Resources