I am hoping for guidance on how to set --environment_config when running the Beam wordcount.py demo.
It runs fine with DirectRunner. Flink's wordcount also runs fine (ie running Flink via flink run).
I would like to run Beam using the Flink runner using a "seperate Flink cluster" as described in the beam documentation. I can't use Docker, so I plan to use --environment_type=PROCESS.
I am using the following inside the python code to set environment_config:
environment_config = dict()
environment_config['os'] = platform.system().lower()
environment_config['arch'] = platform.machine()
environment_config['command'] = 'ls'
ec = "--environment_config={}".format(json.dumps(environment_config))
Obviously the command is incorrect. When I run this, Flink does receive and successfully process the DataSource sub-tasks. It eventually time-outs on the CHAIN MapPartitions.
Could someone provide guidance (or links) as to how to set environment_config? I am running Beam within a Singularity container.
For environment_type=DOCKER, most everything's taken care of for you, but in process mode you have to do a lot of setup yourself. The command you're looking for is sdks/python/container/build/target/launcher/linux_amd64/boot. You will be required to have both that executable (which you can build from source using ./gradlew :sdks:python:container:build) and a Python installation including Beam and other dependencies on all of your worker machines.
The best example I know of is here: https://github.com/apache/beam/blob/cbf8a900819c52940a0edd90f59bf6aec55c817a/sdks/python/test-suites/portable/py2/build.gradle#L146-L165
Related
I'm working with the examples provided in 'flink-training' in the GitHub repository here. Specifically, I'm working on the 'ride-cleansing' example.
I've replaced the PrintSinkFunction with a simple FileSink configured as follows:
FileSink fileSink =
FileSink.forRowFormat(new Path(args[0]),
new SimpleStringEncoder<String>("UTF-8"))
.withRollingPolicy(DefaultRollingPolicy.builder()
.withRolloverInterval(Duration.ofMinutes(1))
.withInactivityInterval(Duration.ofSeconds(30))
.withMaxPartSize(512 * 512 * 512)
.build())
.build();
When I run this example on my local machine in Intellij, the expected directory are created and files are created to reflect the data streamed to the sink.
However, when I run this same example on a Linux box (on Google Colab), the directory is created, but no files are created, regardless of how long I leave it running (I've tried 10+ minutes).
On the Linux Container, I'm running the example using the gradle setup and the following command:
./gradlew :ride-cleansing:runJavaSolution --args="/content/datastream"
On the Windows box, I'm just executing the RideCleansingSolution 'main' with a simple 'Application' run configuration.
What might be different about my setup on the two systems that would decide whether data is written?
it might not work, but if you set up mono develop on whatever nx your using and write it all in c# via Xamarin in VS.NET23 it MIGHT work seamlessly across all platforms and arches... but I'm just spitballing here so `_o_/'
I'd like to run an instance of Vespa outside of a container (e.g. Docker). The Docker path is definitely quite convenient and works great. But I would like to go thru the process by hand of setting up an instance on macOS and seeing more of the 'nuts and bolts' of Vespa.
It appears there are nice docs which outline a path to building RPM's for Centos, etc. Would walking thru that process and adapting to macOS be my best bet?
Unfortunately, running Vespa on MacOS directly is not yet supported. I'd suggest instead running a CentOS VM or cloud instance and experimenting there.
From this tutorial, it says we can run flink by start-local.bat. But flink 1.1x has no such .bat files any more. From recent tuturial, you have to run flink by WSL or Cygwin.
Flink itself runs fine on windows. The only issue is with the scripts used to manage the cluster, and to submit jobs. For that you need some way to run linux shell scripts.
This needn't have much impact; many Flink developers never install a local cluster anyway. You can develop in your IDE just fine, and then use Docker whenever you want to bring up a local environment that resembles production.
I recently came across a bug in Flink, reported (https://issues.apache.org/jira/browse/FLINK-8685) and found out that it has been reported and a pull request has been created (https://github.com/apache/flink/pull/5174).
Now I clone 1.5-SNAPSHOT, apply the patch and build Flink. Even though it builds (no matter patch is applied or not), when I run Flink (using start-cluster.sh), web dashboard doesn't work and command
tail log/flink-*-jobmanager-*.log returns "tail: cannot open 'log/flink-*-jobmanager-*.log' for reading: No such file or directory"
. I tested with a batch programs and surprisingly it returned results on terminal, but streaming programs and other things still don't work.
Any suggestions on this issue?
Thank you.
In case flink dashboard does not start change port in conf file and restart. Default port of flink could be occupied by other process in windows.
Also change log level for flink to debug.
The build definition is created but in order to automate the build process I need to start the build via the command line.
How is this done ? Reading the doc on the scm command line client this does'nt seem to be described :
http://pic.dhe.ibm.com/infocenter/rtc/v1r0m0/index.jsp?topic=%2Fcom.ibm.team.scm.doc%2Ftopics%2Fc_scm_cli.html
I don't think scm is involved at all for launching a build.
You check out the Java API: See "Automated Build Output Management Using the Plain Java Client Libraries".
Or, you can use the JB Toolkit, and use a task like requestTeamBuild:
The requestTeamBuild task requests a build by using a specified build definition.
There must be an active engine that supports the build definition in order for the request to succeed.