how to collect the result from worker node and print it in intellij? - apache-flink

My code is here
I use this code in intellij,my step is:
①mvn clean
②mvn package
③run
This code is used for connecting to remote cluster with intellij.
the print() make the result saved in random taskmanager in random node in the cluster,
so I need to look for the result in $FLINK_HOME/log/*.out
Is there a way to collect these result and printed in intellij's console window?
Thanks for your help.

If you run the job within IntelliJ, using a local stream execution environment, e.g., via
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
rather than on a remote cluster, print() will shows its results in the console. But with a remote stream execution environment, the results will end up in the task managers' file systems, as you have noted.
I don't believe there is a convenient way to collect these results. Flink is designed around scalability, and thus the parallel sinks are designed to avoid any sort of bottleneck. Anything that will unify all of these output streams is a hindrance to scalability.
But what you could do, if you want to have all of the results show up in one place, would be to reduce the parallelism of the PrintSink to 1. This won't bring the results into IntelliJ, but it will mean that you'll find all of the output in one file, on one task manager. You would do that via
.print()
.setParallelism(1)

Related

How to convince Flink to rename .inprogress files to part-xxx

We have unit tests for a streaming workflow (using Flink 1.14.4) with bounded sources, writing Parquet files. Because it's bounded, checkpointing is automatically disabled (as per the INFO msg Disabled Checkpointing. Checkpointing is not supported and not needed when executing jobs in BATCH mode.), which means setting ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true has no effect.
Is the only solution to run the harness with unbounded sources in a separate thread, and force it to terminate when no more data is written to the output? Seems awkward...
The solution, for others, is:
Make sure you're using FileSink, not the older StreamingFileSink.
Set ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true.

Apache Beam - Flink runner - FileIO.write - issues in S3 writes

I am currently working on a Beam pipeline (2.23) (Flink runner - 1.8) where we read JSON
events from Kafka and write the output in parquet format to S3.
We write to S3 after every 10 min.
We have observed that our pipeline sometimes stops writing to S3 after making minor non breaking code changes and deploying pipeline, if we change kafka
offset and restart pipeline it starts writing to S3 again.
While FileIO does not write to s3, Pipeline runs fine without any error/exception and it
processes records until FileIO stage. It gives no error/exceptions in logs
but silently fails to process anything at FileIO stage.
Watermark also does not progress for that stage and it shows watermark of the time when pipeline was stopped for deploy (savepoint time)
We have checked our Windowing function by logging records after windowing,
windowing works fine.
Also if we replace FileIO with Kafka as output, pipeline runs fine and keep outputting records to kafka after deploys.
This is our code snippet -
parquetRecord.apply("Batch Events", Window.<GenericRecord>into(
FixedWindows.of(Duration.standardMinutes(Integer.parseInt(windowTime))))
.triggering(AfterWatermark.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO,
Window.ClosingBehavior.FIRE_ALWAYS)
.discardingFiredPanes())
.apply(Distinct.create())
.apply(FileIO.<GenericRecord>write()
.via(ParquetIO.sink(getOutput_schema()))
.to(outputPath.isEmpty() ? outputPath() :
outputPath)
.withNumShards(1)
.withNaming(new
CustomFileNaming("snappy.parquet")));
Flink UI screenshot. It shows records are coming till FileIO.Write.
This is the stage where it is not sending any records out -
FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/GroupIntoShards ->
FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles/ParMultiDo(WriteShardsIntoTempFiles)
-> FileIO.Write/WriteFiles/GatherTempFileResults/Add void
key/AddKeys/Map/ParMultiDo(Anonymous)
Any idea what could be wrong here or any open bugs in Beam/Flink?
It seems that no output is coming from this GroupByKey: https://github.com/apache/beam/blob/050b642b49f71a71434480c29272a64314992ee7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L674
This is because by default the output is re-windowed into the global window and the trigger is set to the default trigger.
You will need to add .withWindowedWrites to your FileIO configuration.
Have you tried increasing the .withNumShards(1)? We had an batch use case that is failing with Shards set to 1. Also writing to S3 from FlinkRunner. We think it is a bug with FlinkRunner.

Hazelcast much slower than Ignite on single-node map

I was running a simple benchmark on Hazelcast (using JMH), comparing it with Apache Ignite.
This is for a single node deployment.
The cache configuration is left at default,
final Config config = new Config();
return Hazelcast.newHazelcastInstance(config);
And I use put and get with the map,
private IMap<Long, Customer> normalCache = hazelcast.getMap( CacheName.NORMAL.getCacheName());
public void saveToCache(Customer customer) {
normalCache.put(customer.getId(), customer);
}
From the results, it seems that Ignite is 3-4X faster than Hazelcast.
I had figured the difference would be much lesser.
Both for Ignite as well as Hazelcast, I haven't used any other optimizations (near caches etc), just went with the default configuration (result is in ops/sec, throughput).
Is this the expected performance difference or are the results wrong ?
Please run in a client server setup, or with multiple nodes.
AFAIK in case of Ignite, if a local call is done, it is done on the calling thread instead of being offloaded to a partition threads.
Nice for benchmarks, not very useful for production environments because most calls will not be local (in case of a client server setup, no call is local).
You're using a distributed system for a single node deployment??? Really, don't do that, it's like using a Porsche to drive over a muddy field. If you want single node try something else like EhCache, I'm guessing you don't need High Availability, Backups etc.
If you do want to choose between Ignite or Hazelcast run a real distributed test. Start 3 nodes using Client/Server and then see which one is the fastest.

Should I have to sumit jobs to spark or I can run them from client lib?

So I'm learning about Spark and I have a question about how client libs works.
My goal is to do some sort of data analysis in Spark, telling it where are the data sources (databases, cvs, etc) to process, and store results in hdfs, s3 or any kind of database like MariaDB or MongoDB.
I though about having a service (API application) that "tells" spark what I want to do. The question is: Is it enough setting the master configuration with spark:remote-host:7077 at context creation or should I send the application to spark with some sort of spark-submit command?
This completely depends on how your environment is set up, if all paths are linked to your account you should be able to run one of the two commands, to efficiently open the shell and run test commands. The reason to have a shell, is this will allow you to dynamically run commands and validate/learn how to run/tether commands onto one another and see what results come out.
Scala
spark-shell
Python
pyspark
Inside of the environment, if everything is linked to Hive tables you can check the tables by running
spark.sql("show tables").show(100,false)
The above command will run a "show tables" on the Spark-Hive-Metastore Catalogue and will return all active tables you can see (doesn't mean you can access the underlying data). The 100 means I am going to look at 100 rows and the false means to show the full string not the first N many characters.
In a mythical example if one of the tables you see is called Input_Table you can bring it into the environmrnt with the below commands
val inputDF = spark.sql("select * from Input_Table")
inputDF.count
I would heavily advise, while your learning, not to run the commands via Spark-Submit, because you will need to pass through the Class and Jar, forcing you to edit/rebuild for each testing making it difficult to figure how commands will run without have a lot of down time.

Why SortPartition command does not stop running on the remote cluster?

In my program, I have a simple sortPartition command as shown in the following snippet. It works fine on a local cluster.
SortedData = myData.sortPartition(19, Order.ASCENDING).setParallelism(1);
I submitted the program to a remote cluster but, there is a problem and execution does not finish. It seems that the job keeps running and the command never ends.
My dataset contains only 300k records and 50M bytes. If I reduce the number of records in the dataset to 50k, the program works normally on the remote cluster. Obviously, the memory could not an issue here.
I would like to know, what causes such a problem and any solution to deal with the problem?

Resources