Can I initialize data map in open function of ScalarFunction in Flink

Can I initialize data map in open function of ScalarFunction in Flink - apache-flink

I implemented a customized ScalarFunction class, I would like to initialize a hashmap which read data from a database in open() function, but it always stuck over there, so is it a right way to use open() function like this? and how many time the open() function would be invoked, just once or the same times with eval() function?
My sample code as following:
public class GenNameUDF extends ScalarFunction {
#Override
public void open(FunctionContext context) throws Exception {
super.open(context);
CommonClass.map = initMap();//here will read data from db
}
public String eval(String pubIp) {
return todo();
}
}

If the amount of data in db is small, you can use broadcast to acess db only once
Otherwise, you can build connection in open method, and access db when needed
-yt $file is also helpful.

Related

Does Flink DataStream have api like mapPartition?

I want to use a non serializable object in stream.map() like this
stream.map { i =>
val obj = new SomeUnserializableClass()
obj.doSomething(i)
}
It is very inefficient, because I create many SomeUnserializableClass instance. Actually, it can be created only once in each worker.
In Spark, I can use mapPartition to do this. But in flink stream api, I don't known.

If you are dealing with a non serializable class what I recommend you is to create a RichFunction. In your case a RichMapFunction.
A Rich operator in Flink has a open method that is executed in the taskmanager just one time as initializer.
So the trick is to make your field transient and instantiate it in your open method.
Check below example:
public class NonSerializableFieldMapFunction extends RichMapFunction {
transient SomeUnserializableClass someUnserializableClass;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
this.someUnserializableClass = new SomeUnserializableClass();
}
#Override
public Object map(Object o) throws Exception {
return someUnserializableClass.doSomething(o);
}
}
Then your code will looks like:
stream.map(new NonSerializableFieldMapFunction())
P.D: I'm using java syntax, please adapt it to scala.

Non Serializable object in Apache Flink

I am using Apache Flink to perform analytics on streaming data.
I am using a dependency whose object takes more than 10 secs to create as it is reads several files present in hdfs before initialisation.
If I initialise the object in open method I get a timeout Exception and if in the constructor of a sink/flatmap, I get serialisation exception.
Currently I am using static block to initialise the object in some other class, using Preconditions.checkNotNull(MGenerator.mGenerator) in main file and then it's working if used in a flatmap of sink.
Is there a way to create a non serializable dependency's object which might take more than 10 secs to be initialised in Flink's flatmap or sink?
public class DependencyWrap {
static MGenerator mGenerator;
static {
final String configStr = "{}";
final Config config = new Gson().fromJson(config, Config.class);
mGenerator = new MGenerator(config);
}
}
public class MyStreaming {
public static void main(String[] args) throws Exception {
Preconditions.checkNotNull(MGenerator.mGenerator);
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(parallelism);
...
input.flatMap(new RichFlatMapFunction<Map<String,Object>,List<String>>() {
#Override
public void open(Configuration parameters) {
}
#Override
public void flatMap(Map<String,Object> value, Collector<List<String>> out) throws Exception {
out.collect(MFVGenerator.mfvGenerator.generateMyResult(value.f0, value.f1));
}
});
}
}
Also, Please correct me if I am wrong about the question.

Doing it in the Open method is 100% the right way to do it. Is Flink giving you a timeout exception, or the object?
As a last ditch method, you could wrap your object in a class that contains both the object and it's JSON string or Config (is Config serializable?) with the object marked transient and then override the ReadObject/WriteObject methods to call the constructor. If the mGenerator object itself is stateless (and you'll have other problems if it's not), the serialization code should get called only once when jobs are distributed to taskmanagers.

Using open is usually the right place to load external lookup sources. The timeout is a bit odd, maybe there is a configuration around it.
However, if it's huge using a static loader (either static class as you did or singleton) has the benefit that you only need to load it once for all parallel instances of the task on the same task manager. Hence, you save memory and CPU time. This is especially true for you, as you use the same data structure in two separate tasks. Further, the static loader can be lazily initialized when it's used for the first time to avoid the timeout in open.
The clear downside of this approach is that the testability of your code suffers. There are some ways around that, which I could expand if there is interest.
I don't see a benefit of using the proxy serializer pattern. It's unnecessarily complex (custom serialization in Java) and offers little benefit.

Flink - Why should I create my own RichSinkFunction instead of just open and close my PostgreSql connection?

I would like to know why I really need to create my own RichSinkFunction or use JDBCOutputFormat to connect on the database instead of just Create my connection, perform the query and close the connection using the traditional PostgreSQL drivers inside my SinkFunction?
I found many articles telling do to that but does not explain why? What is the difference?
Code example using JDBCOutputFormat,
JDBCOutputFormat jdbcOutput = JDBCOutputFormat.buildJDBCOutputFormat()
.setDrivername("org.postgresql.Driver")
.setDBUrl("jdbc:postgresql://localhost:1234/test?user=xxx&password=xxx")
.setQuery(query)
.setSqlTypes(new int[] { Types.VARCHAR, Types.VARCHAR, Types.VARCHAR }) //set the types
.finish();
Code example implementing the own RichSinkFunction,
public class RichCaseSink extends RichSinkFunction<Case> {
private static final String UPSERT_CASE = "INSERT INTO public.cases (caseid, tracehash) "
+ "VALUES (?, ?) "
+ "ON CONFLICT (caseid) DO UPDATE SET "
+ " tracehash=?";
private PreparedStatement statement;
#Override
public void invoke(Case aCase) throws Exception {
statement.setString(1, aCase.getId());
statement.setString(2, aCase.getTraceHash());
statement.setString(3, aCase.getTraceHash());
statement.addBatch();
statement.executeBatch();
}
#Override
public void open(Configuration parameters) throws Exception {
Class.forName("org.postgresql.Driver");
Connection connection =
DriverManager.getConnection("jdbc:postgresql://localhost:5432/casedb?user=signavio&password=signavio");
statement = connection.prepareStatement(UPSERT_CASE);
}
}
why I cannot just use the PostgreSQL driver?
public class Storable implements SinkFunction<Activity>{
#Override
public void invoke(Activity activity) throws Exception {
Class.forName("org.postgresql.Driver");
try(Connection connection =
DriverManager.getConnection("jdbc:postgresql://localhost:5432/casedb?user=signavio&password=signavio")){
statement = connection.prepareStatement(UPSERT_CASE);
//Perform the query
//close connection...
}
}
}
Does someone know the technical answer to the best practice in Flink?
Does Implementation of RichSinkFunction or usage of JDBCOutputFormat do something special?
Thank you in advance.

Well You can use your own SinkFunction that will simply use invoke() method to open connection and write data and it should work in general. But it's performance will be very, very poor in most cases.
The actual difference between first example and the second example is the fact that in the RichSinkFunction you are using open() method to open the connection and prepare the statement. This open() method is invoked only once when the function is initialized. In the second example you will open the connection to the database and prepare statement inside the invoke() method, which is invoked for every element of the input DataStream.You will actually open a new connection for every element in the stream.
Creating a database connection is expensive thing to do, and it will for sure have terrible performance drawbacks.

How to use cakephp 2 with custom database connection and raw queries

I have to work with an Oracle database using the old database driver (ora_logon ) which is not supported by cakephp. I cant use the oci driver instead.
Right now I do the follow:
Every method of every model connects to the database and retrieve data
class SomeClass extends Model {
public function getA(){
if ($conn=ora_logon("username","password"){
//make the query
// retrieve data
//put data in array and return the array
}
}
public function getB(){
if ($conn=ora_logon("username","password"){
//make the query
// retrieve data
//put data in array and return the array
}
}
}
I know that it is not the best way go.
How could I leave cakephp manage opening and closing of the connection to the database and have models only retrieve data? I'm not interested in any database abstraction layer.

I would think you could just make your own OracleBehavior. Each model could use this behavior, and in it, you can overwrite or extend the Model's find() behavior to build a traditional oracle query and run it (I don't know much about Oracle).
Then, in your Behavior's beforeFind() you can open your connection, and in your Behavior's afterFind(), you can close your database connection.
That way, every time before a query is run, it automatically opens the connection, and every time after a find it closes it. You can do the same with beforeSave() and afterSave() and beforeDelete() and afterDelete(). (You'll likely want to create a single connect() method and disconnect() method in the Behavior, so you don't have duplicate code in each beforeX() method.

Do you really need to extend a Cake Model class?
class SomeClass extends Model {
private $conn;
public function constructor() {
parent::constructor();
$conn = ora_logon("username","password");
if(!$conn)
throw new Exception();
}
public function getA() {
//Some code
}
}
SomeController:
App::uses('SomeClass','Model');
public function action() {
$data = array();
$error = null;
try{
$myDb = new SomeClass();
$data = $myDb->getA();
} catch($e) {
$error = 'Cannot connect to database';
}
$this->set(compact('data', 'error'));
}

Akka/Camel UntypedConsumerActor not consuming from file-based queue

I'm trying to put together my first Akka/Camel application from "scratch" (read, "noob") using the following lib versions:
akka-camel: 2.2.0-RC1
According to all of the documentation I can find (Akka docs, user groups, etc.) all I have to do to consume from a file-based queue is set up my system this way:
Main class:
actorSystem = ActorSystem.create("my-system");
Props props = new Props(Supervisor.class);
ActorRef supervisor = actorSystem.actorOf(props, "supervisor");
Camel camel = CamelExtension.get(actorSystem);
CamelContext camelContext = camel.context();
camelContext.start();
Supervisor class:
import akka.actor.ActorRef;
import akka.actor.Props;
import akka.camel.javaapi.UntypedConsumerActor;
import org.apache.camel.Message;
/**
* Manages creation and supervision of UploadBatchWorkers.
*/
public class Supervisor extends UntypedConsumerActor {
#Override
public String getEndpointUri() {
return "file:///Users/myhome/queue";
}
#Override
public void preStart() {
String test = "test";
}
#Override
public void onReceive(Object message) {
if (message instanceof CamelMessage) {
// do something
}
}
My problem is that even though I know the supervisor object is being created and breaks during debugging on the preStart() method's "test" line (not to mention that if I explicitly "tell" it something it processes fine), it does not consume from the defined endpoint, even though I have another application producing messages to the same endpoint.
Any idea what I'm doing wrong?

Ok, the problem was my own fault and is clearly visible in the example code if you look at the Consumer trait from which the UntypedConsumerActor inherits.
This method:
#Override
public void preStart() {
String test = "test";
}
overrides its parent's preStart() method, right? Well, that parent method is actually the one that registers the consumer with the on-the-fly created endpoint, so while you can override it, you must call super() or it will not work.
Hope this is useful to someone down the road!

Try changing your instanceof inside of onReceive to this:
if (message instanceof CamelMessage){
//do processing here
}
Where CamelMessage is from package akka.camel. That's what the examples in the akka camel docs are doing.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Can I initialize data map in open function of ScalarFunction in Flink - apache-flink

If the amount of data in db is small, you can use broadcast to acess db only once Otherwise, you can build connection in open method, and access db when needed -yt $file is also helpful.

Related

Does Flink DataStream have api like mapPartition?

Non Serializable object in Apache Flink

Flink - Why should I create my own RichSinkFunction instead of just open and close my PostgreSql connection?

How to use cakephp 2 with custom database connection and raw queries

Akka/Camel UntypedConsumerActor not consuming from file-based queue

Categories

Resources