I am writing a batch job with Apache Flink using the DataSet API. I can read a text file using readTextFile() but this function just read one file at once.
I would like to be able to consume all the text files in my directory one by one and process them at the same time one by one, in the same function as a batch job with the DataSet API, if it is possible.
Other option is implement a loop doing multiple jobs, one for each file, instead of one job, with multiples files. But I think this solution is not the best.
Any suggestion?
If I got the documentation right you can read an entire path using ExecutionEnvironment.readTextFile(). You can find an example here: Word-Count-Batch-Example
References:
Flink Documentation
Flink Sources
Related
I want to read files everyday, so the path should be always set for current date. How can I do it in flink?
I think you can either generate the path of the file you expect to exist, and then read it, or set up a streaming job that uses file discovery to ingest files as they become available.
See the docs for the FileSource and readfile. readfile is the legacy connector for ingesting files; FileSource was introduced in Flink 1.12.
fileInputFormat.setFilesFilter(filter);
I decided to use this but not sure it is a best practice.
file:/../..?noop=true&scheduler=quartz2&scheduler.cron=0+0/10+*+*+*+?
Using noop=true allows me to have the files at same place after the route consumes the files but it also enables idempotent which I don't want. (There is second route which will do deletion based on some other logic so the first route shouldn't lead to infinite loop by consuming non-idempotently I believe)
I think I can overwrite the file and use idempotentKey as ${file:name}-${file:modified} so that the file will be picked up on next polling but that still means extra write. Or just deleting and creating the same file also should work but again not a clean approach.
Is there a better way to accomplish this? I could not find it in Camel documentation.
Edit: To summarize I want to read the same files over and over in a scheduled manner (say every 10 mins) from the same repo. SOLVED! - Answer below.
Camel Version: 2.14.1
Thanks!
SOLVED!
file:/../..?noop=true&idempotent=false&scheduler=quartz2&scheduler.cron=0+0/10+*+*+*+?
I am trying to process some files on HDFS and write the results back to HDFS too. The files are already prepared before job starts. The thing is I want to write to different paths and files based on the file content. I am aware that BucketingSink(doc here) is provided to achieve this in Flink streaming. However, it seems that Dataset does not have a similar API. I have found out some Q&As on stackoverflow.(1, 2, 3). Now I think I have two options:
Use Hadoop API: MultipleTextOutputFormat or MultipleOutputs;
Read files as stream and use BucketingSink.
My question is how to make a choice between them, or is there another solution ? Any help is appreciated.
EDIT: This question may be a duplicate of this .
We faced the same problem. We too are surprised that DataSet does not support addSink().
I recommend not switching to Streaming mode. You might give up some optimizations (i.e Memory pools) that are available in batch mode.
You may have to implement your own OutputFormat to do the bucketing.
Instead, you can extend the OutputFormat[YOUR_RECORD] (or RichOutputFormat[]) where you can still use the BucketAssigner[YOUR_RECORD, String] to open/write/close output streams.
That's what we did and it's working great.
I hope flink would support this soon in Batch Mode soon.
I have a use case of reading sequence files as a Batch job in Flink Dataset. The files are stored in S3 bucket which I have to consume in a Flink Dataset. I am not able to read the files by providing comma(,) separated file paths to read in the Dataset. I cannot read the data in file using a loop as there are a lot of files in the bucket. Also the union function for Flink Datasets seems to fail after few iterations. Can someone help me out with creating a custom Sequence file reader which will work for this case as provided in Spark.
I'm trying to process large (~50mb) sized xml files to store in the datastore. I've tried using backends, sockets (to pull the file via urlfetch), and even straight up uploading the file within my source code, but again keep running into limits (i.e. the 32 mb limit).
So, I'm really confused (and a little angry/frustrated). Does appengine really have no real way to process a large file? There does seem to be one potential work around, which would involve remote_apis, amazon (or google compute I guess) and a security/setup nightmare...
Http ranges was another thing I considered, but it'll be painful to somehow connect the different splitted parts together (unless I can manage to split the file at exact points)
This seems crazy so I thought I'd ask stackover flow... am I missing something?
update
Tried using range requests and it looks like the server I'm trying to stream from doesn't use it. So right now I'm thinking either downloading the file, hosting it on another server, then use appengine to access that via range http requests on backends AND then automate the entire process so I can run it as a cron job :/ (the craziness of having to do all this work for something so simple... sigh)
What about storing it in the cloud storage and reading it incrementally, as you can access it line by line (in Python anyway) so it wont' consume all resources.
https://developers.google.com/appengine/docs/python/googlecloudstorageclient/
https://developers.google.com/storage/
The GCS client library lets your application read files from and write
files to buckets in Google Cloud Storage (GCS). This library supports
reading and writing large amounts of data to GCS, with internal error
handling and retries, so you don't have to write your own code to do
this. Moreover, it provides read buffering with prefetch so your app
can be more efficient.
The GCS client library provides the following functionality:
An open method that returns a file-like buffer on which you can invoke
standard Python file operations for reading and writing. A listbucket
method for listing the contents of a GCS bucket. A stat method for
obtaining metadata about a specific file. A delete method for deleting
files from GCS.
I've processed some very large CSV files in exactly this way - read as much as I need to, process, then read some more.
def read_file(self, filename):
self.response.write('Truncated file content:\n')
gcs_file = gcs.open(filename)
self.response.write(gcs_file.readline())
gcs_file.seek(-1024, os.SEEK_END)
self.response.write(gcs_file.read())
gcs_file.close()
Incremental reading with standard python!