Using PyFlink with LightGBM - apache-flink

Is it possible to use PyFlink with python machine learning libraries such as LightGBM for a streaming application? Is there any good example for this?

There is no complete example but you can take a loot at Getting Started with Flink Python and then take a look at how Python UDFs can be used: UDFs in the Table API.

Related

deploy h2o.ai trained learner in snowflake

I am reading article titles that suggest h2o.ai integrates its ML in/with snowflake.
https://www.h2o.ai/resources/solution-brief/integration-of-h2o-driverless-ai-with-snowflake/
If I wanted to export a POJO learner like a gbm and have it run in snowflake, is there a clean way to do that? I didn't see any clear directions in the (several) articles I found.
How does that integrate with ML-ops?
One way to integrate models built in H2o.ai is to integrate through Snowflake External Functions.
This is documented at https://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/snowflake-integration.html
H2o.ai also have (or will have shortly) support to deploy models into Snowflake Java UDFs and it is described in https://www.h2o.ai/blog/h2o-integrates-with-snowflake-snowpark-java-udfs-how-to-better-leverage-the-snowflake-data-marketplace-and-deploy-in-database/

Migrating from Google App Engine Mapreduce to Apache Beam

I have been a long-time user of Google App Engine's Mapreduce library for processing data in the Google Datastore. Google no longer supports it and it doesn't work at all in Python 3. I'm trying to migrate our older Mapreduce jobs to Google's Dataflow / Apache Beam runner, but the official documentation is awful, it just describes Apache Beam, it does not tell you how to migrate.
In particular, the issues are this:
in Mapreduce, the jobs would run on your existing deployed application. However in Beam you have to create and deploy a custom Docker image to build the environment for Dataflow, is this right?
To create a new job template in Mapreduce, you just need to edit a yaml file and deploy it. To create one in Apache beam, you need to create custom runner code, a template file deployed to google cloud storage, and link up with the docker image, is this right?
Is the above accurate? If so, is it generally the case that working with Dataflow is much more difficult than Mapreduce? Are there any libraries or tips for making this easier?
In technical terms that's what is happening, but unless you have some specific advanced use-cases, you won't need to set any custom Docker images manually. Dataflow does some work in the background to run your user code and dependencies on a custom container so that it can execute your user-written code and dependencies on their VMs.
In Dataflow, writing a job template mainly requires writing some pipeline code in your chosen language (Java or Python), and possibly writing some metadata. Once your code is written, creating and staging the template itself isn't much different than running a normal Dataflow job. There's a page documenting the process.
I agree the page on Mapreduce to Beam migration is very sparse and unhelpful, although I think I understand why that is. Migrating from Mapreduce to Beam isn't a straightforward 1:1 migration where only the syntax changes. It's a different pipeline model and most likely will require some level of rewriting your code for the migration. A migration guide that fully covered everything would end up repeating most of the existing documentation.
Since it sounds like most of your questions are around setting up and executing Beam pipelines, I encourage you to begin with the Dataflow quickstart in your chosen language. It won't teach you how to write pipelines, but will teach you how to set up your environment to write and run pipelines. There are links in the quickstarts which direct you to Apache Beam tutorials that teach you the Beam API and how to write your own pipelines, and those will be useful for rewriting your Mapreduce code in Beam.

How does a PyFlink job call external jar?

I want to call my Java interfaces in a jar file in a PyFlink job. No solutions are found in the offical document.
It looks to me like support for this was not included in Flink 1.9, but is ongoing work. See FLIP-58. FLIP-78 and FLIP-88 may also be of interest. Note that most of these improvements will be included in the upcoming Flink 1.10 release.
You can use python table api to register java user-defined function if it satisfies your need. The signature of method is register_java_function in table_environment

why do we have flink-streaming-java and flink-streaming-scala modules in flink source code

In Fink source, there are flink-stream-java and flink-stream-scala modules. Why do we need two modules for flink streaming?
https://github.com/apache/flink/tree/master/flink-streaming-java
https://github.com/apache/flink/tree/master/flink-streaming-scala
Both flink-stream-java and flink-stream-scala provide a similar API to manage Flink Streams ; you only have to use one of them, depending on your language.
Please note that whatever your choice, some dependencies like flink-runtime and flink-clients depend on a version of scala (2.11 or 2.12), because Flink is based on a framework written in scala, Akka.
There is an ongoing effort to remove scala dependency from a higher level API, flink-table (FLINK-11063).
flink-stream-java is the implement of java api for stream. flink-stream-scala is the implement of scala api for stream. So you can find DataStream.java in flink-stream-java, and DataStream.scala in flink-stream-scala.
These two modules will accomplish the same function, but different developers receive different languages, and personal task scala is more suitable for operator description in languages ​​such as big data, flink spark, etc.

Continuous Training in Sagemaker

I am trying out Amazon Sagemaker, I haven't figured out how we can have Continuous training.
For example if i have a CSV file in s3 and I want to train each time the CSV file is updated.
I know we can go again to the notebook and re-run the whole notebook to make this happen.
But i am looking for an automated way, with some python scripts or using a lambda function with s3 events etc
You can use boto3 sdk for python to start training on lambda then you need to trigger the lambda when csv is update.
http://boto3.readthedocs.io/en/latest/reference/services/sagemaker.html
Example python code
https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model-create-training-job.html
Addition: You dont need to use lambda you just start/cronjob the python script any kind of instance which has python and aws sdk in it.
There are a couple examples for how to accomplish this in the aws-samples GitHub.
The serverless-sagemaker-orchestration example sounds most similar to the use case you are describing. This example walks you through how to continuously train a SageMaker linear regression model for housing price predictions on new CSV data that is added daily to a S3 bucket using the built-in LinearLearner algorithm, orchestrated with Amazon CloudWatch Events, AWS Step Functions, and AWS Lambda.
There is also the similar aws-sagemaker-build example but it might be more difficult to follow currently if you are looking for detailed instructions.
Hope this helps!

Resources