Read file attibutes for all files in the tree using as few IO operation as possible - nio

I have a lot of small files on a NFS drive (Amazon EFS in my case). Files are provided over HTTP protocol, very similar to a classical Web-Server it does. As I need to validate the last modification of the file, it takes at least a single I/O per file request. It is a case even if I already cached the file body in RAM.
Is there a way to read the last modify attribute for all the files in the tree (or at least in a single directory) using only a single I/O operation?
There it a method Files.readAttributes which reads multiple attributes of a single file as a bulk operation. I am looking for bulk operation to read a single attribute of multiple files.
UPDATE: in case of NFS this question is how to utilize NFS command READDIRPLUS. This command does exactly what I need, but it seems to be no way to use it out of Java I/O library.

I don't know of a standard Java class to list all the files and modified time in one operation, but if you are permitted to utilise the host environment and the NFS drive is mount you could adapt the following technique to suit your environment:
ProcessBuilder listFiles = new ProcessBuilder("bash", "", "ls -l");
Process p = listFiles.start();
BufferedReader reader = new BufferedReader(new InputStreamReader(p.getInputStream()));
String inputLine;
List<String> filesWithAttributes = new ArrayList<String>();
while ((inputLine = reader.readLine()) != null) {
filesWithAttributes.add(inputLine);
}

I think this question may be a duplicate of Getting the last modified date of a file in Java. Nonetheless, I think if you use lastModified() of the File class, you probably use the least IO operation. So for this, I would use something similar to icyrock.com's answer. Which would be:
new File("/path/to/file").lastModified()
Also, the answers to the questions java - File lastModified vs reading the file might be able to give you helpful information.

Related

How to know fast if another computer is accesible in AS3 (Adobe Air)

I'm using FileStream to read files in two computers that are connected in a Local Area Network (LAN).
I have no problem to read the files when the others computers are connected. I'm checking if the directory exists before writing the file
I'm checking if the directory AND the file exists before reading the file.
file_pc1 = new File("//OTHER-PC/Folder/file.csv");
directory = new File("//OTHER-PC/Folder");
if (directory_pc1.exists)
{
stream.open(file_pc1, FileMode.WRITE);
stream.writeUTFBytes(csv.data);
stream.close();
}
if (directory_pc1.exists && file_pc1.exists)
{
stream.open(file_pc1, FileMode.READ);
stream.writeUTFBytes(csv.data);
stream.close();
}
All this works great but if the other computer is not connected, the statements directory_pc1.exists and file_pc1.exists takes a very long time and the app freezes, sometime even sending the "Application is not responding" message from Windows, but it finally responds after a long time.
Is there a fastest way to check if i'm connected to another PC?
While I don't know of a method to make File.exists() perform faster (likely there is no way as it's more of an OS issue), you can at least mitigate the issue by using asynchronous operations instead - thus avoiding locking the UI.
You can skip the exists operation, and just attempt to open the file async, if it errors, then you know the file doesn't exist and it will likely take about the same amount of time as using File.exists.
So something like this:
stream.addEventListener(IOErrorEvent.IO_ERROR, fileIOError);
stream.addEventListener(Event.COMPLETE, loadComplete);
stream.openAsync(file_pc1, FileMode.UPDATE);
function loadComplete(e:Event):void {
stream.writeUTFBytes(csv.data);
stream.close();
}
function fileIOError(e:IOErrorEvent):void {
//file doesn't exist, do something
}
That all said, since your just writing a file, it doesn't actually make sense to even check if it exists or not (unless you need to use the current data in that file as part of your write operation) - if you used FileMode.WRITE, it doesn't matter if the file exists or not it will attempt to save the file. You'll need to wrap it in a try catch though in case the network location is unavailable.

Node JS: Import large array and then perform regex

I am building a command line tool and want to use Node JS in this particular case.
I have a TXT file on which I want to perform regex on each line and use those within another function.
1) Should I import-convert the TXT file into an ARRAY using readFileSync or readFile AND then go through the elements of this array?
2) Should I go with readLines?
This file's size might be up to 5 MB but it will get larger and larger with time (up to hundred(s)).
3) Should I use Python, Ruby or any other language for this specific purpose? Would any other language make it much better? (Please answer the first two questions as my ability of not-using-node and option for something much different might not be possible)
Ultimately I want all this data to be stored in memory to be used over an over again at different times so any other solution, as long as it will be fast, I can consider.
Thank you very much.
3) You should use something async, like Node.js. The benefits are that you can read a chunk of the file and process it on the spot (but without blocking your entire app while this happens and without buffering the whole file), then move to the next chunk and so on. At any time you can pause the stream if you wish.
2) I think you should read (and then process) the file line by line.
1) You should definitely choose a readStream: http://nodejs.org/docs/v0.6.18/api/fs.html#fs_class_fs_readstream
That way you won't have to wait for the whole file to be read (and kept in memory). Here's a small snippet on how to achive this by using readStream and carrier (https://github.com/pgte/carrier):
var fs = require('fs'),
carrier = require('carrier'),
file = 'test.txt',
stream;
stream = fs.ReadStream(file, { encoding: 'UTF-8' });
carrier.carry(stream, function(line) {
extractWithRegex(line);
});

Reading a file directly from HDFS into a shell function

I have a shell function that is called from inside my map function. The shell function takes 2 parameters -> an input file and an output file. Something like this
$> unix-binary /pathin/input.txt /pathout/output.txt
The problem is, that these input.txt files reside in HDFS and the output.txt files need to be written back to HDFS. Currently, I first copy the needed file with fs.copyToLocalFile into the local hard drive, call the unix binary and then write the output.txt back to HDFS with fs.copyFromLocalFile.
The problem with this approach is that, it is not optimal because it involves substantial amount of redundant reading and writing to HDD which slows down the performance. So, my question is, how I can read the HDFS file directly as an input and output the results directly to HDFS?
obviously,
$>unix-binary hdfs://master:53410/pathin/input.txt' hdfs://master:54310/pathout/output.txt
will not work. Is there any other way around? Can I treat an HDFS file as a loacl file somehow?
I have access to the unix-binary source code written in C. Maybe changing the source code would help?
thanks
You can add the file to the DistributedCache and access it from the mapper from the cache. Call your shell function on the local file and write the output file to local disk and then copy the local file to HDFS.
However, operations such as calling shell functions, or reading/writing from within a mapper/reducer break the MapReduce paradigm. If you find yourself needing to perform such operations, MapReduce may not be the solution you're looking for. HDFS and MapReduce were designed to perform massive scale batch processing on small numbers of extremely large files.
Since you have access to unix-binary source code, your best option might be to implement the particular function(s) you want in java. Feed the input files to your mapper and call the function from the mapper on the data rather than working with files on HDFS/LocalFS.

Check if another program has a file open

After doing tons of research and nor being able to find a solution to my problem i decided to post here on stackoverflow.
Well my problem is kind of unusual so I guess that's why I wasn't able to find any answer:
I have a program that is recording stuff to a file. Then I have another one that is responsible for transferring that file. Finally I have a third one that gets the file and processes it.
My problem is:
The file transfer program needs to send the file while it's still being recorded. The problem is that when the file transfer program reaches end of file on the file doesn't mean that the file actually is complete as it is still being recorded.
It would be nice to have something to check if the recorder has that file still open or if it already closed it to be able to judge if the end of file actually is a real end of file or if there simply aren't further data to be read yet.
Hope you can help me out with this one. Maybe you have another idea on how to solve this problem.
Thank you in advance.
GeKod
Simply put - you can't without using filesystem notification mechanisms, windows, linux and osx all have flavors of this. I forget how Windows does it off the top of my head, but linux has 'inotify' and osx has 'knotify'.
The easy way to handle this is, record to a tmp file, when the recording is done then move the file into the 'ready-to-transfer-the-file' directory, if you do this so that the files are on the same filesystem when you do the move it will be atomic and instant ensuring that any time your transfer utility 'sees' a new file, it'll be wholly formed and ready to go.
Or, just have your tmp files have no extension, then when it's done rename the file to an extension that the transfer agent is polling for.
Have you considered using stream interface between the recorder program and the one that grabs the recorded data/file? If you have access to a stream interface (say an OS/stack service) which also provides a reliable end of stream signal/primitive you could consider that to replace the file interface.
There is no functions/libraries available in C to do this. But a simple alternative is to rename the file once an activity is over. For example, recorder can open the file with name - file.record and once done with recording, it can rename the file.record to file.transfer and the transfer program should look for file.transfer to transfer and once the transfer is done, it can rename the file to file.read and the reader can read that and finally rename it to file.done!
you can check if file is open or not as following
FILE_NAME="filename"
FILE_OPEN=`lsof | grep $FILE_NAME`
// if [ -z $FILE_NAME ] ;then
// "File NOT open"
// else
// "File Open"
refer http://linux.about.com/library/cmd/blcmdl8_lsof.htm
I think an advisory lock will help. Since if one using the file which another program is working on it, the one will get blocked or get an error. but if you access it in force,the action is Okey, but the result is unpredictable, In order to maintain the consistency, all of the processes who want to access the file should obey the advisory lock rule. I think that will work.
When the file is closed then the lock is freed too.Other processes can try to hold the file.

Following multiple log files efficiently

I'm intending to create a programme that can permanently follow a large dynamic set of log files to copy their entries over to a database for easier near-realtime statistics. The log files are written by diverse daemons and applications, but the format of them is known so they can be parsed. Some of the daemons write logs into one file per day, like Apache's cronolog that creates files like access.20100928. Those files appear with each new day and may disappear when they're gzipped away the next day.
The target platform is an Ubuntu Server, 64 bit.
What would be the best approach to efficiently reading those log files?
I could think of scripting languages like PHP that either open the files theirselves and read new data or use system tools like tail -f to follow the logs, or other runtimes like Mono. Bash shell scripts probably aren't so well suited for parsing the log lines and inserting them to a database server (MySQL), not to mention an easy configuration of my app.
If my programme will read the log files, I'd think it should stat() the file once in a second or so to get its size and open the file when it's grown. After reading the file (which should hopefully only return complete lines) it could call tell() to get the current position and next time directly seek() to the saved position to continue reading. (These are C function names, but actually I wouldn't want to do that in C. And Mono/.NET or PHP offer similar functions as well.)
Is that constant stat()ing of the files and subsequent opening and closing a problem? How would tail -f do that? Can I keep the files open and be notified about new data with something like select()? Or does it always return at the end of the file?
In case I'm blocked in some kind of select() or external tail, I'd need to interrupt that every 1, 2 minutes to scan for new or deleted files that shall (no longer) be followed. Resuming with tail -f then is probably not very reliable. That should work better with my own saved file positions.
Could I use some kind of inotify (file system notification) for that?
If you want to know how tail -f works, why not look at the source? In a nutshell, you don't need to periodically interrupt or constantly stat() to scan for changes to files or directories. That's what inotify does.

Resources