My aim is to implement dynamic rule based validation of a streaming dataset. My project is using Pyflink. I know that there is a Broadcast pattern in Flink, but didnt find any credible info with regards to the same in Python. Is this feature available in Pyflink ?. If not is there any workaround to implement dynamic rules in Pyflink
Related
I want to create a Custom Apache Flink Sink to AWS Sagemaker Feature store, but there is no documentation for how to create custom sinks on Flink's website. There are also multiple base classes that I can potentially extend (e.g. AsyncSinkBase, RichSinkFunction), so I'm not sure which to use.
I am looking for guidelines regarding how to implement a custom sink (both in general and for my specific use-case). For my specific use-case: Sagemaker Feature Store has a synchronous client with a putRecord call to send records to AWS Sagemaker FS, so I am ideally looking for a way to create a custom sink that would work well with this client. Note: I require at at least once processing guarantees, as Sagemaker FS is DynamoDB (a key-value store) under the hood.
Java Client: https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/sagemakerfeaturestoreruntime/AmazonSageMakerFeatureStoreRuntime.html
Example of the putRecord call using the Python client: https://github.com/aws-samples/amazon-sagemaker-feature-store-streaming-aggregation/blob/main/src/lambda/StreamingIngestAggFeatures/lambda_function.py#L31
What I've Found so Far
Some older articles which say to use org.apache.flink.streaming.api.functions.sink.RichSinkFunction and SinkFunction
Some connectors using classes in org.apache.flink.connector.base.sink.writer (e.g. AsyncSinkWriter, AsyncSinkBase)
This section of the Flink docs says to use the SourceReaderBase from org.apache.flink.connector.base.source.reader when creating custom sources; SourceBaseReader seems to be the equivalent source to the sink classes in the bullet above
Any help/guidance/insights are much appreciated, thanks.
How about extending RichAsyncFunction ?
you can find similar example here - https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/#async-io-api
Hi I am planning to use flink as a backend for my feature where we will show a UI to user to graphically create event patterns for eg: Multiple login failures from the same Ip address.
We will create the flink pattern programmatically using the given criteria by the user in the UI.
Is there any documentation on how to dynamically create the jar file and dynamically submit the job with it to flink cluster?
Is there any best practice for this kind of use case using apache flink?
The other way you can achieve that is that you can have one jar which contains something like an “interpreter” and you will pass to it the definition of your patterns in some format (e.g. json). After that “interpreter” translates this json to Flink’s operators. It is done in such a way in https://github.com/TouK/nussknacker/ Flink’s based execution engine. If you use such an approach you will need to handle redeployment of new definition in your own application.
One straightforward way to achieve this would be to generate a SQL script for each pattern (using MATCH_RECOGNIZE) and then use Ververica Platform's REST API to deploy and manage those scripts: https://docs.ververica.com/user_guide/application_operations/deployments/artifacts.html?highlight=sql#sql-script-artifacts
Flink doesn't provide tooling for automating the creation of JAR files, or submitting them. That's the sort of thing you might use a CI/CD pipeline to do (e.g., github actions).
Disclaimer: I work for Ververica.
In Fink source, there are flink-stream-java and flink-stream-scala modules. Why do we need two modules for flink streaming?
https://github.com/apache/flink/tree/master/flink-streaming-java
https://github.com/apache/flink/tree/master/flink-streaming-scala
Both flink-stream-java and flink-stream-scala provide a similar API to manage Flink Streams ; you only have to use one of them, depending on your language.
Please note that whatever your choice, some dependencies like flink-runtime and flink-clients depend on a version of scala (2.11 or 2.12), because Flink is based on a framework written in scala, Akka.
There is an ongoing effort to remove scala dependency from a higher level API, flink-table (FLINK-11063).
flink-stream-java is the implement of java api for stream. flink-stream-scala is the implement of scala api for stream. So you can find DataStream.java in flink-stream-java, and DataStream.scala in flink-stream-scala.
These two modules will accomplish the same function, but different developers receive different languages, and personal task scala is more suitable for operator description in languages such as big data, flink spark, etc.
Is it possible to use custom uima annotators in Retrieve&Rank service?
How can I upload my custom annotator (packaged as jar file) to the service?
I need to create an entity annotator to discover my custom domain entities.
I don't think there is an obvious straightforward way to use a custom UIMA annotator in R&R.
Possible approaches you could use, if you want to try integrating the two though:
Use a UIMA pipeline to annotate your documents before storing them in R&R, or as you query R&R for them. I've not tried this myself, but I've seen references to this sort of thing - e.g. http://wiki.apache.org/solr/SolrUIMA so there might be some value in trying this
Use the annotations from your UIMA pipeline to generate additional feature scores that the ranker you train can include in it's training. For example, if your annotator detects the presence or absence of a particular custom domain entity, it could turn this into a score that contributes to the feature scores for a search result. For an example of contributing custom feature scorers to R&R, see https://github.com/watson-developer-cloud/answer-retrieval
In HPC we use often SLURM as workload manager. I would like to know if camel or jbpm-camel can to communicate to SLURM ?
I found apache spark but nothing about slurm.
I am interested by any documentation on this subjects.
Thanks for your help
This question is more about-- what API/RPC options are available for SLURM? If it is completely proprietary, than there probably isn't an available Camel component, but one could be put together using one of the low-level socket camel components, such as mina / netty / etc.