Too many open files error when reading from a directory - apache-flink

I'm using readTextFile(/path/to/dir) in order to read batches of files, do some manipulation on the lines and save them to cassandra.
Everything seemed to work just fine, until I reached more than 170 files in the directory (files are deleted after a successful run).
Now I'm receiving "IOException: Too many open files", and a quick look at lsof I see thousands of file descriptors opening once I run my code.
Almost all of the file descriptors are "Sockets".
Testing on a smaller scale with only 10 files resulted in more than 4000 file descriptors opening, once the script has finished, all the file descriptors are closed and back to normal.
Is it a normal behavior of flink? should I increase the ulimit?
Some notes:
The environment is Tomcat 7 with java 8, flink 1.1.2 using DataSet API.
Flink job is scheduled with quartz.
All those 170+ files sum to about ~5MB.

Problem solved.
After narrowing the code, I found out that using "Unirest.setTimeouts" inside a highly parallel "map()" step caused too many threads allocations which in turn consumed all my file descriptors.

Related

Processing one file at a Time in talend

I have a problem with talend. Currently im using the open source software, and I have a file watcher. I need a way in which even if I enter 2 files at the same second, I want talend to process one at a time, once the first has gone through all the mappings, then the other file can proceed.
I can't use tparrelatize because its not available in the version.

Can create a file but not write

I have an application that writes to a log file and also creates other files as it runs. What I am seeing happening is that, in some rare cases, the application is running fine (creating and writing files and appending to the log file) and then suddenly the application can continue to create and delete files, but the application cannot write to a file. The log file stops writing in the middle of a line and other files that are created are 0 bytes because although they can be created, we could not write to them.
Rebooting the machine helps and we can create and write files with no issue after the reboot. All affected machines are running either RHEL6 or CentOS 6. This is not a drive space issue, there is tons of room left on the machines that display the issue.
Does anyone have any clue what might cause this behavior?

Does Windows ftp process commands at the same time or in sequence?

I'm having trouble finding the answer to this question, maybe I'm just not asking the question properly. I have to put a file that is relatively large (~500MB at least) in an ftp server and then run a process that takes it in as a parameter. My question is as follows. If i'm using ftp.exe to do this, does the put command lock the process until the file is finished being copied?
I was planning on using a .bat file to execute the commands needed but I don't know if the file is going to be completely copied before the other process starts reading it.
edit: for clarity's sake, here is a sample of the .bat that I would be executing.
ftp -s:commands.txt ftpserver
and the contents of the commands.txt would be
user
password
put fileName newFileName
quote cmd_to_execute
quit
The Windows ftp.exe (as probably all similar scriptable clients) executes the commands one-by-one.
No parallel processing takes place.
FTP as a protocol doesn't specify placing a lock on the files before writing it. However this doesn't prevent anyone from implementing this feature as it is a great value add.
Some FileSystems NTFS) may provide locking mechanism to prevent concurrent access. See this File locking - Wikipedia
See this thread as a reference: How do filesystems handle concurrent read/write?

Data loss on concurrent file write in camel

I am using camel technology for my file operation. My system is cluster environment.
Let say, I have 4 instances
Instance A
Instance B
Instance C
Instance D
Folders Structure
Input Folder: C:/app/input
Output Folder: C:/app/output
All the four instances will be pointing to Input folder location. As per, my business 8 files will be placed in the input folder and output will be consolidated file. here camel losing data when concurrently writing to output file.
Route:
from("file://C:/app/input")
.setHeader(Exchange.File_Name,simple("output.txt"))
.to("file://C:/app/output?fileExist=Append")
.end();
Kindly help me to resolve this issue. is there any thing like write lock in camel? to avoid concurrent file writer. Thanks in advance
You can use the doneFile option of the file component, see http://camel.apache.org/file2.html for more information.
Avoid reading files currently being written by another application
Beware the JDK File IO API is a bit limited in detecting whether another application is currently writing/copying a file. And the implementation can be different depending on OS platform as well. This could lead to that Camel thinks the file is not locked by another process and start consuming it. Therefore you have to do you own investigation what suites your environment. To help with this Camel provides different readLock options and doneFileName option that you can use. See also the section Consuming files from folders where others drop files directly.

Foreach Loop Container with Foreach File Enumerator option iterates all files twice

I am using the SSIS Foreach Loop Container to iterate through files with a certain pattern on a network share.
I am encountering an kind of unreproducible malfunction of the Loop Container:
Sometimes the loop is executed twice. After all files were processed it starts over with the first file.
Have anyone encountered a similar bug?
Maybe not directly using SSIS but accessing files on a Windows share with some kind of technology?
Could this error relate to some network issues?
Thanks.
I found this to be the case whilst working with Excel files and using the *.xlsx wildcard to drive the foreach.
Once I put logging in place I noticed that when the Excel was opened it produced an excel file prefixed with ~$. This was picked up by the foreach loop.
So I used a trick similar to http://geekswithblogs.net/Compudicted/archive/2012/01/11/the-ssis-expression-wayndashskipping-an-unwanted-file.aspx to exclude files with a ~$ in the filename.
What error message (SSIS log / Eventvwr messages) do you get?
Similar to #Siva, I've not come across this, but some ideas you could use to try and diagnose. You may be doing some of these already, I've just written them down for completeness from my thought processes...
log all files processed. write a line to a log file/table pre-processing (each file), then post-process (each file). Keep the full path of each file. This is actually something we do as standard with our ETL implementations, as users are often coming back to us with questions about when/what has been loaded. This will allow you to see if files are actually being processed twice.
perhaps try moving each file after it is processed to a different directory. That will make it more difficult to have a file processed a second time and the problem may disappear. (If you are processing them from an area that is a "master" area (and so cant move them), consider copying the files to a "waiting" folder, then processing them and moving them to a "processed" folder)
#Siva's comment is interesting - look at the "traverse subfolders" check box.
check your eventvwr for odd network events, or application events (SQL Server restarting?)
use perf mon to see if there is anything odd happening in terms of network load on your server (a bit of a random idea!)
try running your whole process with files on a local disk instead of a network disk, if your mean time between failures is after running 10 times, then you could do this load locally 20-30 times and if you dont get an error it may be a network error
nothing helped - I implemented following workaround: script task in the foreach iterator which tracks all files. if a file was alread loaded a warning is fired and the file is not processed again. anyway, seems to be some network related problem...

Resources