I want to use a non serializable object in stream.map() like this
stream.map { i =>
val obj = new SomeUnserializableClass()
obj.doSomething(i)
}
It is very inefficient, because I create many SomeUnserializableClass instance. Actually, it can be created only once in each worker.
In Spark, I can use mapPartition to do this. But in flink stream api, I don't known.
If you are dealing with a non serializable class what I recommend you is to create a RichFunction. In your case a RichMapFunction.
A Rich operator in Flink has a open method that is executed in the taskmanager just one time as initializer.
So the trick is to make your field transient and instantiate it in your open method.
Check below example:
public class NonSerializableFieldMapFunction extends RichMapFunction {
transient SomeUnserializableClass someUnserializableClass;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
this.someUnserializableClass = new SomeUnserializableClass();
}
#Override
public Object map(Object o) throws Exception {
return someUnserializableClass.doSomething(o);
}
}
Then your code will looks like:
stream.map(new NonSerializableFieldMapFunction())
P.D: I'm using java syntax, please adapt it to scala.
Related
While going through the Flink official documentation, I came across CheckpointedFunction.
Wondering why and when would you use this function. I am currently working on a stateful Flink job that heavily relies on ProcessFunction to save state in RocksDB. Just wondering if CheckpointedFunction is better than the ProcessFunction.
CheckpointedFunction is for cases where you need to work with state that should be managed by Flink and included in checkpoints, but where you aren't working with a KeyedStream and so you cannot use keyed state like you would in a KeyedProcessFunction.
The most common use cases of CheckpointedFunction are in sources and sinks.
In addition to the answer of #David I have another use case in which I don't use CheckpointedFunction with the source or sink. I do use it in a ProcessFunction where I want to count (programmatically) how many times my job has restarted. I use MyProcessFunction and CheckpointedFunction and I update ListState<Long> restarts when the job restarts. I use this state on the integration tests to ensure that the job was restarted upon a failure. I based my example on the Flink checkpoint example for Sinks.
public class MyProcessFunction<V> extends ProcessFunction<V, V> implements CheckpointedFunction {
...
private transient ListState<Long> restarts;
#Override
public void snapshotState(FunctionSnapshotContext context) throws Exception { ... }
#Override
public void initializeState(FunctionInitializationContext context) throws Exception {
restarts = context.getOperatorStateStore().getListState(new ListStateDescriptor<Long>("restarts", Long.class));
if (context.isRestored()) {
List<Long> restoreList = Lists.newArrayList(restarts.get());
if (restoreList == null || restoreList.isEmpty()) {
restarts.add(1L);
System.out.println("restarts: 1");
} else {
Long max = Collections.max(restoreList);
System.out.println("restarts: " + max);
restarts.add(max + 1);
}
} else {
System.out.println("restarts: never restored");
}
}
#Override
public void open(Configuration parameters) throws Exception { ... }
#Override
public void processElement(V value, Context ctx, Collector<V> out) throws Exception { ... }
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<V> out) throws Exception { ... }
}
I am using Apache Flink to perform analytics on streaming data.
I am using a dependency whose object takes more than 10 secs to create as it is reads several files present in hdfs before initialisation.
If I initialise the object in open method I get a timeout Exception and if in the constructor of a sink/flatmap, I get serialisation exception.
Currently I am using static block to initialise the object in some other class, using Preconditions.checkNotNull(MGenerator.mGenerator) in main file and then it's working if used in a flatmap of sink.
Is there a way to create a non serializable dependency's object which might take more than 10 secs to be initialised in Flink's flatmap or sink?
public class DependencyWrap {
static MGenerator mGenerator;
static {
final String configStr = "{}";
final Config config = new Gson().fromJson(config, Config.class);
mGenerator = new MGenerator(config);
}
}
public class MyStreaming {
public static void main(String[] args) throws Exception {
Preconditions.checkNotNull(MGenerator.mGenerator);
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(parallelism);
...
input.flatMap(new RichFlatMapFunction<Map<String,Object>,List<String>>() {
#Override
public void open(Configuration parameters) {
}
#Override
public void flatMap(Map<String,Object> value, Collector<List<String>> out) throws Exception {
out.collect(MFVGenerator.mfvGenerator.generateMyResult(value.f0, value.f1));
}
});
}
}
Also, Please correct me if I am wrong about the question.
Doing it in the Open method is 100% the right way to do it. Is Flink giving you a timeout exception, or the object?
As a last ditch method, you could wrap your object in a class that contains both the object and it's JSON string or Config (is Config serializable?) with the object marked transient and then override the ReadObject/WriteObject methods to call the constructor. If the mGenerator object itself is stateless (and you'll have other problems if it's not), the serialization code should get called only once when jobs are distributed to taskmanagers.
Using open is usually the right place to load external lookup sources. The timeout is a bit odd, maybe there is a configuration around it.
However, if it's huge using a static loader (either static class as you did or singleton) has the benefit that you only need to load it once for all parallel instances of the task on the same task manager. Hence, you save memory and CPU time. This is especially true for you, as you use the same data structure in two separate tasks. Further, the static loader can be lazily initialized when it's used for the first time to avoid the timeout in open.
The clear downside of this approach is that the testability of your code suffers. There are some ways around that, which I could expand if there is interest.
I don't see a benefit of using the proxy serializer pattern. It's unnecessarily complex (custom serialization in Java) and offers little benefit.
I implemented a customized ScalarFunction class, I would like to initialize a hashmap which read data from a database in open() function, but it always stuck over there, so is it a right way to use open() function like this? and how many time the open() function would be invoked, just once or the same times with eval() function?
My sample code as following:
public class GenNameUDF extends ScalarFunction {
#Override
public void open(FunctionContext context) throws Exception {
super.open(context);
CommonClass.map = initMap();//here will read data from db
}
public String eval(String pubIp) {
return todo();
}
}
If the amount of data in db is small, you can use broadcast to acess db only once
Otherwise, you can build connection in open method, and access db when needed
-yt $file is also helpful.
I am defining certain variables in one java class and i am accessing it with a different class so as to filter the stream for unique elements. Please refer code to understand the issue better.
The problem i am facing is this Filter function doesn't work well and fails to filter unique events. I doubt the variable is shared among different threads and it is the cause!? Please suggest another method if this is not the correct way to do it. Thanks in advance.
**ClassWithVariables.java**
public static HashMap<String, ArrayList<String>> uniqueMap = new HashMap<>();
**FilterClass.java**
public boolean filter(String val) throws Exception {
if(ClassWithVariables.uniqueMap.containsKey(key)) {
Arraylist<String> al = uniqueMap.get(key);
if(al.contains(val) {
return false;
} else {
//Update the hashmap list(uniqueMap)
return true;
}
} else {
//Add to hashmap list(uniqueMap)
return true;
}
}
The correct way to de-duplicate a stream involves partitioning the stream by the key, so that all elements containing the same key will be processed by the same worker, and using flink's managed, keyed state mechanism so that the state is fault-tolerant and re-scalable. Here's a sample implementation:
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.addSource(new EventSource())
.keyBy(e -> e.key)
.flatMap(new Deduplicate())
.print();
env.execute();
}
public static class Deduplicate extends RichFlatMapFunction<Event, Event> {
ValueState<Boolean> seen;
#Override
public void open(Configuration conf) {
ValueStateDescriptor<Boolean> desc = new ValueStateDescriptor<>("seen", Types.BOOLEAN);
seen = getRuntimeContext().getState(desc);
}
#Override
public void flatMap(Event event, Collector<Event> out) throws Exception {
if (seen.value() == null) {
out.collect(event);
seen.update(true);
}
}
}
This could also be implemented as a RichFilterFunction, btw. But note that if you have an unbounded key space, the state being used will grow indefinitely until you run out of heap, or space on the disk, depending on which of Flink's state backends you choose. If this is an issue, you might want to set up a state retention policy via State Time-to-Live.
Note also that sharing state between different parts of a Flink pipeline isn't possible. You need to turn things inside-out compared to what might seem normal, and bring the event stream to the state, rather than fetching it.
I'm trying to put together my first Akka/Camel application from "scratch" (read, "noob") using the following lib versions:
akka-camel: 2.2.0-RC1
According to all of the documentation I can find (Akka docs, user groups, etc.) all I have to do to consume from a file-based queue is set up my system this way:
Main class:
actorSystem = ActorSystem.create("my-system");
Props props = new Props(Supervisor.class);
ActorRef supervisor = actorSystem.actorOf(props, "supervisor");
Camel camel = CamelExtension.get(actorSystem);
CamelContext camelContext = camel.context();
camelContext.start();
Supervisor class:
import akka.actor.ActorRef;
import akka.actor.Props;
import akka.camel.javaapi.UntypedConsumerActor;
import org.apache.camel.Message;
/**
* Manages creation and supervision of UploadBatchWorkers.
*/
public class Supervisor extends UntypedConsumerActor {
#Override
public String getEndpointUri() {
return "file:///Users/myhome/queue";
}
#Override
public void preStart() {
String test = "test";
}
#Override
public void onReceive(Object message) {
if (message instanceof CamelMessage) {
// do something
}
}
My problem is that even though I know the supervisor object is being created and breaks during debugging on the preStart() method's "test" line (not to mention that if I explicitly "tell" it something it processes fine), it does not consume from the defined endpoint, even though I have another application producing messages to the same endpoint.
Any idea what I'm doing wrong?
Ok, the problem was my own fault and is clearly visible in the example code if you look at the Consumer trait from which the UntypedConsumerActor inherits.
This method:
#Override
public void preStart() {
String test = "test";
}
overrides its parent's preStart() method, right? Well, that parent method is actually the one that registers the consumer with the on-the-fly created endpoint, so while you can override it, you must call super() or it will not work.
Hope this is useful to someone down the road!
Try changing your instanceof inside of onReceive to this:
if (message instanceof CamelMessage){
//do processing here
}
Where CamelMessage is from package akka.camel. That's what the examples in the akka camel docs are doing.