I've managed to plug in the GCP PubSub dependency into the Flink Statefun JAR and then build the Docker image.
I've added the below to the pom.xml.
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-gcp-pubsub</artifactId>
<version>1.16.0</version>
<scope>test</scope>
</dependency>
It's not too clear how I now specify my PubSub ingress and egress in the module.yaml that we use with the StateFun image.
https://nightlies.apache.org/flink/flink-statefun-docs-master/docs/modules/overview/
For example, for Kakfa you use:
kind: io.statefun.kafka.v1/egress
spec:
id: com.example/my-egress
address: kafka-broker:9092
deliverySemantic:
type: exactly-once
transactionTimeout: 15min
I can see the official connectors have a Kind const in the Java code that you use to reference the connectors within your module.yaml but I can't see in the docs how to reference the Flink connectors you plug in yourself to the StateFun image.
GCP PubSub is not officially supported as a standard Statefun IO component, only Kafka and Kinesis for now; however you can come up with your own custom ingress/egress connector relatively easily. Unfortunately you won't be able to provide a way to have a new yaml-based config item, as the modules configurators for Kafka and Kinesis seem to be hard-coded in the runtime. You'll have to do your configuration in your code:
Looking at the source/ingress example:
public class ModuleWithSourceSpec implements StatefulFunctionModule {
#Override
public void configure(Map<String, String> globalConfiguration, Binder binder) {
IngressIdentifier<TypedValue> id =
new IngressIdentifier<>(TypedValue.class, "com.example", "custom-source");
IngressSpec<TypedValue> spec = new SourceFunctionSpec<>(id, new FlinkSource<>());
binder.bindIngress(spec);
binder.bindIngressRouter(id, new CustomRouter());
}
}
Your goal is going to be to provide the new FlinkSource<>(), which is a org.apache.flink.streaming.api.functions.source.SourceFunction
You could declare it thus:
SourceFunction source =
PubSubSource.newBuilder()
.withDeserializationSchema(new IntegerSerializer())
.withProjectName(projectName)
.withSubscriptionName(subscriptionName)
.withMessageRateLimit(1)
.build();
You'll also have to come up with a new CustomRouter(), to determine which function instance should handle an event initially. You can take inspiration from here:
public static class GreetingsStateBootstrapDataRouter implements Router<Tuple2<String, Integer>> {
#Override
public void route(
Tuple2<String, Integer> message, Downstream<Tuple2<String, Integer>> downstream) {
downstream.forward(new Address(GREETER_FUNCTION_TYPE, message.f0), message);
}
}
Same thing for sink/egress, no router to provide:
public class ModuleWithSinkSpec implements StatefulFunctionModule {
#Override
public void configure(Map<String, String> globalConfiguration, Binder binder) {
EgressIdentifier<TypedValue> id = new EgressIdentifier<>("com.example", "custom-sink", TypedValue.class);
EgressSpec<TypedValue> spec = new SinkFunctionSpec<>(id, new FlinkSink<>());
binder.bindEgress(spec);
}
}
With new FlinkSink<>() replaced by this sink:
SinkFunction sink =
PubSubSink.newBuilder()
.withSerializationSchema(new IntegerSerializer())
.withProjectName(projectName)
.withTopicName(outputTopicName)
.build();
That you would use like so, in the egress case:
public class GreeterFn implements StatefulFunction {
static final TypeName TYPE = TypeName.typeNameFromString("com.example.fns/greeter");
static final TypeName CUSTOM_EGRESS = TypeName.typeNameFromString("com.example/custom-sink");
static final ValueSpec<Integer> SEEN = ValueSpec.named("seen").withIntType();
#Override
CompletableFuture<Void> apply(Context context, Message message) {
if (!message.is(User.TYPE)) {
throw new IllegalStateException("Unknown type");
}
User user = message.as(User.TYPE);
String name = user.getName();
var storage = context.storage();
var seen = storage.get(SEEN).orElse(0);
storage.set(SEEN, seen + 1);
context.send(
EgressMessageBuilder.forEgress(CUSTOM_EGRESS)
.withUtf8Value("Hello " + name + " for the " + seen + "th time!")
.build());
return context.done();
}
}
You'll also have to make your Module known to the runtime using a file mentioning your Module in the META-INF/services directory of your jar, like so:
com.example.your.path.ModuleWithSourceSpec
com.example.your.path.ModuleWithSinkSpec
Alternatively if you prefer annotations you can use Google Autoservice like so
I hope it helps!
Related
I am curious about how to use this User Configuration option in Flink Jobmanager UI. Is there any way that my application.conf values should be exposed via flink environment and displayed in User configuration. I did not find much documentation regarding this User Configuration online.
If someone has any Idea about it, let me know.
Thanks.
This section of the UI is populated with the GlobalJobParameters that are set via ExecutionConfig#setGlobalJobParameters.
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// assemble a map of values (e.g., from 'args', a file on the classpath or the jar manifest)
Map<String, String> data = ...
env.getConfig().setGlobalJobParameters(new MetaData(data));
// rest the job
...
}
// a trivial mapper around an existing map
private static class MetaData extends ExecutionConfig.GlobalJobParameters {
private final Map<String, String> data;
private MetaData(Map<String, String> data) {
this.data = data;
}
#Override
public Map<String, String> toMap() {
return data;
}
}
We are interested in connecting to a regular Flink Streaming application from new Stateful Functions 🎉, ideally using the Table API. The idea is to consult tables registered in Flink from Statefun, is this possible, and what is the right way to do it?
My idea so far has been to initialize my table stream in some main function and register a stateful function provider to connect to the table:
#AutoService(StatefulFunctionModule.class)
public class Module implements StatefulFunctionModule {
#Override
public void configure(Map<String, String> globalConfiguration, Binder binder) {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
// ingest a DataStream from an external source
DataStream<Tuple3<Long, String, Integer>> ds = env.addSource(...);
// SQL query with an inlined (unregistered) table
Table myTable = tableEnv.fromDataStream(ds, "user, product, amount");
tableEnv.createTemporaryView("my_table", myTable);
TableFunctionProvider tableProvider = new TableFunctionProvider();
binder.bindFunctionProvider(FnEnrichmentCallback.TYPE, tableProvider);
//continue registering my other messages
//...
}
}
The stateful function provider would return a FnTableQuery which simply queries the table whenever it receives a message:
public class TableFunctionProvider implements StatefulFunctionProvider {
#Override
public StatefulFunction functionOfType(FunctionType type) {
return new FnTableQuery();
}
}
The query function object would then operate as an actor for every established process, and simply query the table when invoked:
public class FnTableQuery extends StatefulMatchFunction {
static final FunctionType TYPE = new FunctionType(Identifiers.NAMESPACE, "my-table");
private Table myTable;
#Override
public void configure(MatchBinder binder) {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
myTable = tableEnv.from("my_table");
binder
.otherwise(this::catchAll);
}
private void catchAll(Context context, Object message) {
context.send(FnEnrichmentCallback.TYPE, myTable.select("max(amount)").toString(), message);
}
}
I apologize in advance if this approach doesn't make sense, because I don't know if:
Flink and Statefun applications can work together outside the realm of sources/sinks, especially since this particular function is stateless and the table is stateful
We can query Flink tables like this, I have only queried them as an intermediate object to send to a sink or datastream
It makes sense to initialize things in Module.configure, and if both the stateful function provider and its match function are called once per parallel worker
The Apache Flink community does have in mind to support Flink DataStreams as StateFun ingress / egresses in the future.
What this would mean is that you can take the result streams of using the Flink Table API / Flink CEP / DataStream API etc., and invoke functions using the events in the streams.
I want to use a non serializable object in stream.map() like this
stream.map { i =>
val obj = new SomeUnserializableClass()
obj.doSomething(i)
}
It is very inefficient, because I create many SomeUnserializableClass instance. Actually, it can be created only once in each worker.
In Spark, I can use mapPartition to do this. But in flink stream api, I don't known.
If you are dealing with a non serializable class what I recommend you is to create a RichFunction. In your case a RichMapFunction.
A Rich operator in Flink has a open method that is executed in the taskmanager just one time as initializer.
So the trick is to make your field transient and instantiate it in your open method.
Check below example:
public class NonSerializableFieldMapFunction extends RichMapFunction {
transient SomeUnserializableClass someUnserializableClass;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
this.someUnserializableClass = new SomeUnserializableClass();
}
#Override
public Object map(Object o) throws Exception {
return someUnserializableClass.doSomething(o);
}
}
Then your code will looks like:
stream.map(new NonSerializableFieldMapFunction())
P.D: I'm using java syntax, please adapt it to scala.
I tried to inject with #Autowired annotation a repository into changelog
and it doesn't get injected.
Config uses spring application context
#Bean
public SpringBootMongock mongock(ApplicationContext springContext, MongoClient mongoClient) {
return new SpringBootMongockBuilder(mongoClient, "yourDbName", "com.package.to.be.scanned.for.changesets")
.setApplicationContext(springContext)
.setLockQuickConfig()
.build();
}
And the changelog
#ChangeLog(order = "001")
public class MyMigration {
#Autowired
private MyRepository repo;
#ChangeSet(order = "001", id = "someChangeId", author = "testAuthor")
public void importantWorkToDo(DB db){
repo.findAll() // here null pointer
}
}
firstly, notice that if you are using repositories in your changelogs, it's a bad practice to use it for writes, as it won't be covered by the lock mechanism(this is feature is coming soon), only for reads.
To inject your repository(or any other dependency) you simply need to inject it in your changeSet method signature, like this:
#ChangeLog(order = "001")
public class MyMigration {
#ChangeSet(order = "001", id = "someChangeId", author = "testAuthor")
public void importantWorkToDo(MongoTemplate template, MyRepository repo){
repo.findAll(); this should work
}
}
Notice that you should use the last version(at this moment 3.2.4) and DB class is not supported anymore. Please use MongoDatabase or MongoTemplate(preferred).
Documentation to Mongock
we have recently released the version 4.0.7.alpha, which among other things allows you to use Spring repositories(and any other custom bean you wish) in your changeSets with no problem. You can insert, update, delete and read. It will be safely covered by the lock.
The only restriction is that it needs to be an interface, which should be the common case for Spring repositories.
Please take a look to this example
Is it possible to make an application using Spring DATA with common code that supports both RDMS and Nosql(MongoDb) as back-end data store.It should support either one of them at one point of time and it should be configurable.
I have just pushed a new Spring-Data project named spring-data-gremlin which aims to do exactly this. It uses JPA annotations to map to any Tinkerpop blueprints graph database (OrientDB, TitanDB, etc). This means that switching between RDBMS and nosql graph databases should be a matter of configuration for any Spring-Data-JPA project.
Note: The project is in early stages of development and therefore not all JPA annotations are implemented yet.
I don't know for sure for MongoDB but we currently have projects configured with Spring Data JPA and Spring Data Neo4J simultaneously. I can't think of any obstacles why you could not make this work with Spring Data JPA and Spring Data MongoDB.
Be aware of transaction management: as far as I know MongoDB does not support transactionability so any kind of writing to both data sources can not be done as atom operation. If this is not an issue, you're good to go.
Our example snippet:
<neo4j:config storeDirectory="${neo4j.storeDirectory}"
base-package="app.model.neo4j" />
<neo4j:repositories base-package="app.neo4j.repo" />
<tx:annotation-driven transaction-manager="neo4jTransactionManager" />
And Spring Data JPA in Configuration annotated class:
#Configuration
#EnableJpaRepositories(value = "app.dao", entityManagerFactoryRef = "emf", transactionManagerRef = "tm")
#ComponentScan("app")
#EnableTransactionManagement
public class ConfigDao {
protected final String PROPERTY_DB_MODEL_PACKAGESTOSCAN = "db.model.packagesToScan";
protected final String PROPERTY_DB_DRIVER_CLASSNAME = "db.driver.className";
protected final String PROPERTY_DB_URL = "db.url";
protected final String PROPERTY_DB_USERNAME = "db.username";
protected final String PROPERTY_DB_PASSWORD = "db.password";
protected final String PROPERTY_DB_ADDITIONAL_DDL = "hibernate.hbm2ddl.auto";
protected final String PROPERTY_DB_ADDITIONAL_DIALECT = "hibernate.dialect";
protected final String PROPERTY_DB_ADDITIONAL_EMF_NAME = "hibernate.ejb.entitymanager_factory_name";
#Bean
public DataSource dataSource() {
DriverManagerDataSource dataSource = new DriverManagerDataSource();
dataSource.setDriverClassName(PROPERTY_DB_DRIVER_CLASSNAME);
dataSource.setUrl(PROPERTY_DB_URL);
dataSource.setUsername(PROPERTY_DB_USERNAME);
dataSource.setPassword(PROPERTY_DB_PASSWORD);
return dataSource;
}
#Bean
public PlatformTransactionManager transactionManager() {
JpaTransactionManager transactionManager = new JpaTransactionManager();
transactionManager.setEntityManagerFactory(entityManagerFactory().getObject());
return transactionManager;
}
#Bean
public EntityManager entityManager() {
return entityManagerFactory().getObject().createEntityManager();
}
#Bean
public LocalContainerEntityManagerFactoryBean entityManagerFactory() {
LocalContainerEntityManagerFactoryBean em = new LocalContainerEntityManagerFactoryBean();
em.setDataSource(dataSource());
em.setPackagesToScan(PROPERTY_DB_MODEL_PACKAGESTOSCAN);
JpaVendorAdapter vendorAdapter = new HibernateJpaVendorAdapter();
em.setJpaVendorAdapter(vendorAdapter);
em.setJpaProperties(additionalJpaProperties());
return em;
}
#Bean
protected Properties additionalJpaProperties() {
Properties properties = new Properties();
properties.setProperty(PROPERTY_DB_ADDITIONAL_DDL);
properties.setProperty(PROPERTY_DB_ADDITIONAL_DIALECT);
properties.setProperty(PROPERTY_DB_ADDITIONAL_EMF_NAME);
return properties;
}
}
Hope it helps.