I try to write parquet files with Flink table api but the files that are written empty. I use StreamExecutionTable and write the files to my local filesystem. When I write jsons the files are written correctly.
I have tried enabling checkpointing and adding rolling-policy properties but nothing helped.
The version is 1.14
Thanks :)
Related
I'm experienceing some odd behaviour when writing ORC files to S3 using flinks Streaming File Sink.
StreamingFileSink<ArmadaRow> orderBookSink = StreamingFileSink
.forBulkFormat(new Path(PARAMETER_TOOL_CONFIG.get("order.book.sink")),
new OrcBulkWriterFactory<>(new OrderBookRowVectorizer(_SCHEMA), writerProperties, new Configuration()))
.withBucketAssigner(new OrderBookBucketingAssigner())
.withRollingPolicy(OnCheckpointRollingPolicy.build())
.build();
I noticed when running queries during ingest of the data, that my row counts were being decremented as the job progressed. I've had a look at S3 and I can seem multiple versions of the same part file. The example below shows part file 15-7 has two versions. The first file is 20.7mb and the last file that's committed is smaller at 5.1mb. In most cases the current file is normally larger but in my instance there are a few examples in the screenshot below where this is not the case.
I noticed from the logs on the TaskManager that Flink committed both these files at pretty much the same time. I'm not sure if this is known issue with Flinks streaming file sink or potentially some misconfiguration on my part. I'm using Flink 1.11 by the way.
2022-02-28T20:44:03.224+0000 INFO APP=${sys:AppID} COMP=${sys:CompID} APPNAME=${sys:AppName} S3Committer:64 - Committing staging/marketdata/t_stg_globex_order_book_test2/cyc_dt=2021-11-15/inst_exch_mrkt_id=XNYM/inst_ast_sub_clas=Energy/part-15-7 with MPU ID
2022-02-28T20:44:03.526+0000 INFO APP=${sys:AppID} COMP=${sys:CompID} APPNAME=${sys:AppName} S3Committer:64 - Committing staging/marketdata/t_stg_globex_order_book_test2/cyc_dt=2021-11-15/inst_exch_mrkt_id=XNYM/inst_ast_sub_clas=Energy/part-15-7 with MPU ID
Edit
I've also tested this with version 1.13 and the same problem occurs.
Upgrading To Flink 1.13 and using the new FileSink API resolved this error
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/file_sink/#bulk-encoded-formats
As mentioned in the document:
For example a data pipeline might monitor a file system directory for new files and write their data into an event log. Another application might materialize an event stream to a database or incrementally build and refine a search index.
So, how can I follow a local file system file updating while using Flink?
Here, the document also mentioned that:
File system sources for streaming is still under development. In the future, the community will add support for common streaming use cases, i.e., partition and directory monitoring.
Does this mean I could use the API to do some special streaming? If you know how to use streaming file system source, please tell me. Thanks!
I need to configure file names for files created by StreamingFileSink.
I use ParquetAvroWriters.forGenericRecord to create parquet files.
Discovered that i cant use .withOutputFileConfig() when i use .forBulkFormat() (which is available when use .forRowFormat().
It started working with Flink 1.11
I want to upload lots of source files (say, java) to solr to allow indexed search on them.
They should be posted as plain text files.
No special parsing is required.
When trying to upload one java file I get "Unknown Source" related error.
java.lang.NoClassDefFoundError: com/uwyn/jhighlight/renderer/XhtmlRendererFactory
When I rename the file adding .txt in the end, it is uploaded successfully.
I have thousands of files to upload on a daily basis and need to keep original names.
How do I tell solr to treat all files in the directory as .txt?
Advanced thanks!
For googlers, concerning the Solr error:
java.lang.NoClassDefFoundError: com/uwyn/jhighlight/renderer/XhtmlRendererFactory
You can correct this by adding the jar "jhighlight-1.0.jar" in Solr. To do so:
Download the old solr 4.9. In recent version, jhighlight is not present.
Extract solr-4.9.0\contrib\extraction\lib\jhighlight-1.0.jar
Copy jhighlight-1.0.jar to the solr installation under solr/server/lib/ext/
Restart the server.
You can achieve the same by integrating solr with tika.
Apache will help you to extract the text of the source files.
It has a source code parser which supports c,c++ and Java.
Here is the link which will give you more details.
https://googleweblight.com/?lite_url=https://tika.apache.org/1.12/formats.html&lc=en-IN&s=1&m=972&host=www.google.co.in&ts=1461564865&sig=APY536wBFFAcFH7yUyvhh2TFslPz6LeClA
I want to sending multiple files to solr using curl.How i can do it ?
I can done with only one file with command for example:
curl
"http://localhost:8983/solr/update/extract?literal.id=paas2&commit=true"
-F "file=#cloud.pdf"
Anyone can help me,
Tks
The api does not support passing multiple files for extraction.
Usually the last file will be the only one thats gets uploaded and added.
You can have individual files indexed as separate entities in Solr.
OR One way to upload multiple files is to zip these files and upload the zip file.
There is one issue with Solr indexing zip files and you can try the SOLR-2332 Patch
i using apache solr 4.0 Beta which have capability to upload multiple file and generate id for each file uploaded using post.jar and It's very helpfull for me.
Let'see on :
http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29
Thanks all :)
my problem have solved :)