Using SourceFunction and SinkFunction in PyFlink - apache-flink

I am new to PyFlink. I have done the official training exercise in Java: https://github.com/apache/flink-training
However, the project I am working on must use Python as a programming language. I want to know if it is possible to write a data generator using the "SourceFunction". In older PyFlink versions this was possible, using Jython: https://nightlies.apache.org/flink/flink-docs-release-1.7/dev/stream/python.html#streaming-program-example
In newer examples the dataframe contains a finite set of data, which is never extended. I have not found any example of a data generator in PyFlink, e.g. https://github.com/apache/flink-training/blob/master/common/src/main/java/org/apache/flink/training/exercises/common/sources/TaxiRideGenerator.java
I am not sure which functionality the interfaces Source and SinkFunction provide. Can it be used somehow in python or can it only be used in combination with other pipelines or jar files? It looks like the methods "run()" and "cancel()" are not implemented and thus it cannot be used like some other classes, by overloading.
If it can not be used in Python, are there any other ways to use it? Someone may provide an easy example.
If it is not possible to use it, are there any other ways to write a data generator in OOP style? Take this example: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/datastream_tutorial/ There the split() method is used to separate the stream. Basically, I want to do this by an extra class and just extending the stream, which was done in the Java TaxiRide example via "ctx.collect()". I am trying to avoid using Java, another framework for the pipeline, and Jython. It would be nice to get a short example code, but I appreciate any tips and advice.
I tried to use SourceFunction directly, but as already mentioned, I think this is a completely wrong way, resulting in an error: AttributeError: 'DataGenerator' object has no attribute '_get_object_id'
class DataGenerator(SourceFunction):
def __init__(self):
super().__init__(self)
self._num_iters = 1000
self._running = True
def run(self, ctx):
counter = 0
while self._running and counter < self._num_iters:
ctx.collect('Hello World')
counter += 1
def cancel(self):
self._running = False

Solution:
After looking in some older code using the classes Source and SinkFunction, I came to a solution. Here a kafka connector written in Java is used. The python code can be taken as an example of how to use pyflink's Source and SinkFuntion.
I have only written an example for the SourceFunction:
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream import SourceFunction
from pyflink.java_gateway import get_gateway
class TaxiRideGenerator(SourceFunction):
def __init__(self):
java_src_class = get_gateway().jvm.org.apache.flink.training.exercises.common.sources.TaxiRideGenerator
java_src_obj = java_src_class()
super(TaxiRideGenerator, self).__init__(java_src_obj)
def show(ds, env):
# this is just a little helper to show the output of the pipeline
ds.print()
env.execute()
def streaming():
# arm the flink ExecutionEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)
taxi_src = TaxiRideGenerator()
ds = env.add_source(taxi_src)
show(ds, env)
if __name__ == "__main__":
streaming()
The second line in the class init was hard to find. I had expected to get an object in the first line.
You have to create a jar file after building this project.
I have entered the path until I see the folder "org":
$ cd flink-training/flink-training/common/build/classes/java/main
flink-training/common/build/classes/java/main$ ls
flink-training/common/build/classes/java/main$ org
flink-training/common/build/classes/java/main$ jar cvf flink-training.jar org/apache/flink/training/exercises/common/**/*.class
Copy the jar file to the pyflink/lib folder, normally under your python environment, e.g. flinkenv/lib/python3.8/site-packages/pyflink/lib. Then start the script.

Related

Does Flink Python API support gauge metric?

I'm using PyFlink for streaming processing, and have added some metrics to monitor the performance.
Here's my code for registering the udf with metrics. I've installed apache-flink 1.13.0.
class Test(ScalarFunction):
def __init__(self):
self.gauge_value = 0
def open(self, function_context):
metric_group = function_context.get_metric_group().add_group("test")
metric_group.gauge("gauge", lambda: self.gauge_value)
def eval(self, i):
self.gauge_value = i
return i
test = udf(
Test(),
input_types=[DataTypes.BIGINT()],
result_type=DataTypes.BIGINT(),
)
There's no exception raised from the log, and once I change the type of self.gauge_value to str, it'll go wrong, so I believe the line is executed and gauge is registered.
However, I could never find the gauge in the metrics tab of Web UI.
I tried to add counters and meters in this test udf, and they all showed up correctly in the metrics tab.
Is gauge supported in PyFlink? If not, is there any alternative of that?

How to provide submodule functions through shared libraries while extending Jenkins Pipeline DSL

When extending DSL, I can extending say this way:
boo {
var1='var'
}
But I want to do extend DSL this way:
boo.RunBooWithFoo('var1')
Can someone provide an example on how to do this?
You can just create a file vars/boo.groovy in the shared library and put that function there.
def RunBooWithFoo(arg) {
//your logic
}
Then in pipeline you can use it this way
#Library('shared-library-name') _
boo.RunBooWithFoo('var1')

F# WPF Esri Developer License

I have gone to License your app and there is only code for C#. I am wondering where to place the license information in my F# WPF application.
Does it go in the app.fs file or a different one.
Thanks
I'm not familiar with this SDK but it should go into the assembly (so exe or dll) file that uses it. It would've been helpful to show the C# code, and not just the link:
Esri.ArcGISRuntime.ArcGISRuntimeEnvironment.ClientId = "mYcLieNTid";
try
{
Esri.ArcGISRuntime.ArcGISRuntimeEnvironment.Initialize();
}
catch (Exception ex)
{
Console.WriteLine("Unable to initialize the ArcGIS Runtime with the client ID provided: " + ex.Message);
}
From the recently much improved F# docs Try/With is the equivalent exception handling mechanism in F#:
Esri.ArcGISRuntime.ArcGISRuntimeEnvironment.ClientId <- "mYcLieNTid";
try
Esri.ArcGISRuntime.ArcGISRuntimeEnvironment.Initialize()
with
| ex -> printfn "Unable to initialize the ArcGIS Runtime with the client ID provided: %A" ex.Message
Additional Info: If you have time please take a look at F# for fun and also Modules and Classes, you can also search these topics on SO. Here are some comments:
F# is sensitive to the file order in the project, please make sure that you put your module above the file that is open-ing it
you generally #load *.fsx scripts, you can but you don't need to do this for *.fs files, as those assumed to be built into an assembly. You can just say open File1 assuming you have a File1.fs and inside it a module File1, then if inside the File1 module you have let let x = 5, you can say File1.x to access it
You don't need to put a module into a separate file. You can just place it ino the namespace you have your SDK in (maybe App.fs).
Easiest would be to actually put this code inside your main function, in the [<EntryPoint>]

Scala: File object for testing (Resource?)

I need to pass a File to a class for unit tests. The class requires a File specifically, and can't be modified - it just reads it. (So I don't want to try to mock out that class or modify it - just pass it the File it needs.)
What's the recommended way to do this?
I put the file in src/test/resources/... and just passed that entire path in, and, since the test is run in the project root dir, this works. But this seems quite wrong.
UPDATE: To emphasize - the class needs a File object, not an InputStream or anything else.
this.getClass().getResourceAsStream(fileName) is recommended way to read files from resources
If you need to look up something on the classpath and have it returned as a File, try this:
import java.net.URL
import java.io.File
object Test extends App {
val fileUrl: URL = getClass.getResource("Test.class")
val file : File = new File(fileUrl.toURI())
println(s"File Path: ${file.getCanonicalPath}")
}
It more or less uses all Java classes, but it works. In my example, the Test object is finding its own compiled classfile, but you can obviously change it to whatever you want.
The JVM doesn't have a concept of getting a classpath resource as a File because resources don't necessarily correspond to files. (Often they're contained in jars.)
Your test could start by getting an InputStream and copying it into a temporary file.

Hadoop Map Whole File in Java

I am trying to use Hadoop in java with multiple input files. At the moment I have two files, a big one to process and a smaller one that serves as a sort of index.
My problem is that I need to maintain the whole index file unsplitted while the big file is distributed to each mapper. Is there any way provided by the Hadoop API to make such thing?
In case if have not expressed myself correctly, here is a link to a picture that represents what I am trying to achieve: picture
Update:
Following the instructions provided by Santiago, I am now able to insert a file (or the URI, at least) from Amazon's S3 into the distributed cache like this:
job.addCacheFile(new Path("s3://myBucket/input/index.txt").toUri());
However, when the mapper tries to read it a 'file not found' exception occurs, which seems odd to me. I have checked the S3 location and everything seems to be fine. I have used other S3 locations to introduce the input and output file.
Error (note the single slash after the s3:)
FileNotFoundException: s3:/myBucket/input/index.txt (No such file or directory)
The following is the code I use to read the file from the distributed cache:
URI[] cacheFile = output.getCacheFiles();
BufferedReader br = new BufferedReader(new FileReader(cacheFile[0].toString()));
while ((line = br.readLine()) != null) {
//Do stuff
}
I am using Amazon's EMR, S3 and the version 2.4.0 of Hadoop.
As mentioned above, add your index file to the Distributed Cache and then access the same in your mapper. Behind the scenes. Hadoop framework will ensure that the index file will be sent to all the task trackers before any task is executed and will be available for your processing. In this case, data is transferred only once and will be available for all the tasks related your job.
However, instead of add the index file to the Distributed Cache in your mapper code, make your driver code to implement ToolRunner interface and override the run method. This provides the flexibility of passing the index file to Distributed Cache through the command prompt while submitting the job
If you are using ToolRunner, you can add files to the Distributed Cache directly from the command line when you run the job. No need to copy the file to HDFS first. Use the -files option to add files
hadoop jar yourjarname.jar YourDriverClassName -files cachefile1, cachefile2, cachefile3, ...
You can access the files in your Mapper or Reducer code as below:
File f1 = new File("cachefile1");
File f2 = new File("cachefile2");
File f3 = new File("cachefile3");
You could push the index file to the distributed cache, and it will be copied to the nodes before the mapper is executed.
See this SO thread.
Here's what helped me to solve the problem.
Since I am using Amazon's EMR with S3, I have needed to change the syntax a bit, as stated on the following site.
It was necessary to add the name the system was going to use to read the file from the cache, as follows:
job.addCacheFile(new URI("s3://myBucket/input/index.txt" + "#index.txt"));
This way, the program understands that the file introduced into the cache is named just index.txt. I also have needed to change the syntax to read the file from the cache. Instead of reading the entire path stored on the distributed cache, only the filename has to be used, as follows:
URI[] cacheFile = output.getCacheFiles();
BufferedReader br = new BufferedReader(new FileReader(#the filename#));
while ((line = br.readLine()) != null) {
//Do stuff
}

Resources