Whats the different between format in flink orc and orc-nohive to sink data - apache-flink

I have to sink data to s3 using flink and orc format, and i found two libraries doing that, what the one to be used
as per the doc, they show orc library, but in my project, they are already using orc-nohive

Related

Flink table api writes empty parquet

I try to write parquet files with Flink table api but the files that are written empty. I use StreamExecutionTable and write the files to my local filesystem. When I write jsons the files are written correctly.
I have tried enabling checkpointing and adding rolling-policy properties but nothing helped.
The version is 1.14
Thanks :)

Flink Streaming File Sink Producing And Uploading Different Versions Of The Same Part File

I'm experienceing some odd behaviour when writing ORC files to S3 using flinks Streaming File Sink.
StreamingFileSink<ArmadaRow> orderBookSink = StreamingFileSink
.forBulkFormat(new Path(PARAMETER_TOOL_CONFIG.get("order.book.sink")),
new OrcBulkWriterFactory<>(new OrderBookRowVectorizer(_SCHEMA), writerProperties, new Configuration()))
.withBucketAssigner(new OrderBookBucketingAssigner())
.withRollingPolicy(OnCheckpointRollingPolicy.build())
.build();
I noticed when running queries during ingest of the data, that my row counts were being decremented as the job progressed. I've had a look at S3 and I can seem multiple versions of the same part file. The example below shows part file 15-7 has two versions. The first file is 20.7mb and the last file that's committed is smaller at 5.1mb. In most cases the current file is normally larger but in my instance there are a few examples in the screenshot below where this is not the case.
I noticed from the logs on the TaskManager that Flink committed both these files at pretty much the same time. I'm not sure if this is known issue with Flinks streaming file sink or potentially some misconfiguration on my part. I'm using Flink 1.11 by the way.
2022-02-28T20:44:03.224+0000 INFO APP=${sys:AppID} COMP=${sys:CompID} APPNAME=${sys:AppName} S3Committer:64 - Committing staging/marketdata/t_stg_globex_order_book_test2/cyc_dt=2021-11-15/inst_exch_mrkt_id=XNYM/inst_ast_sub_clas=Energy/part-15-7 with MPU ID
2022-02-28T20:44:03.526+0000 INFO APP=${sys:AppID} COMP=${sys:CompID} APPNAME=${sys:AppName} S3Committer:64 - Committing staging/marketdata/t_stg_globex_order_book_test2/cyc_dt=2021-11-15/inst_exch_mrkt_id=XNYM/inst_ast_sub_clas=Energy/part-15-7 with MPU ID
Edit
I've also tested this with version 1.13 and the same problem occurs.
Upgrading To Flink 1.13 and using the new FileSink API resolved this error
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/file_sink/#bulk-encoded-formats

How to follow a updating local file while using flink

As mentioned in the document:
For example a data pipeline might monitor a file system directory for new files and write their data into an event log. Another application might materialize an event stream to a database or incrementally build and refine a search index.
So, how can I follow a local file system file updating while using Flink?
Here, the document also mentioned that:
File system sources for streaming is still under development. In the future, the community will add support for common streaming use cases, i.e., partition and directory monitoring.
Does this mean I could use the API to do some special streaming? If you know how to use streaming file system source, please tell me. Thanks!

Searalize protobuff data to avro - Apache Flink

Is it possible to serialize protobuff data to Avro and write to files using Apache Flink sink?
There is currently no out-of-the-box solution for protobuf yet. It's high on the priority list.
You can use protobuf-over-kryo or parse/serialize manually in the meantime.

Apache Flink: How can I add header to a CSV file using the writeAsFormattedText sink?

I am using Apache Flink framework dataset API to extract data from various sources, transform it and load it to various sinks.
During this period, I am working to some scenarios where datasets of POJOs are loaded to CSV files. For this purpose, I am using writeAsFormattedText method if Flink framework and the respective text formatters in order to avoid mapping our data to tuples (and use writeTextAsCsv). One of my requirements is to add a header row on top of the CSV file produced. For example,
id,name,age
1,john,30
2,helen,42
How can such requirement be covered using the dataset API of the Apache Flink?
Thanks in advance

Resources