Reading a file directly from HDFS into a shell function - file

I have a shell function that is called from inside my map function. The shell function takes 2 parameters -> an input file and an output file. Something like this
$> unix-binary /pathin/input.txt /pathout/output.txt
The problem is, that these input.txt files reside in HDFS and the output.txt files need to be written back to HDFS. Currently, I first copy the needed file with fs.copyToLocalFile into the local hard drive, call the unix binary and then write the output.txt back to HDFS with fs.copyFromLocalFile.
The problem with this approach is that, it is not optimal because it involves substantial amount of redundant reading and writing to HDD which slows down the performance. So, my question is, how I can read the HDFS file directly as an input and output the results directly to HDFS?
$>unix-binary hdfs://master:53410/pathin/input.txt' hdfs://master:54310/pathout/output.txt
will not work. Is there any other way around? Can I treat an HDFS file as a loacl file somehow?
I have access to the unix-binary source code written in C. Maybe changing the source code would help?

You can add the file to the DistributedCache and access it from the mapper from the cache. Call your shell function on the local file and write the output file to local disk and then copy the local file to HDFS.
However, operations such as calling shell functions, or reading/writing from within a mapper/reducer break the MapReduce paradigm. If you find yourself needing to perform such operations, MapReduce may not be the solution you're looking for. HDFS and MapReduce were designed to perform massive scale batch processing on small numbers of extremely large files.
Since you have access to unix-binary source code, your best option might be to implement the particular function(s) you want in java. Feed the input files to your mapper and call the function from the mapper on the data rather than working with files on HDFS/LocalFS.


Multithreaded compression, random access and on-the-fly reading

I have a program running on linux which generates thousand of text files. I want these files to be packed into a single (compressed) file.
The compressed file will later be opened by a C program, which needs to access specific files inside that container, in a random fashion.
The whole thing is working as follows:
Linux program generates thousands of small files
zip -9 *
C program with libzip accesing specific files inside .zip, depending on what the user requests. These reads are done on memory (no writing decompressed files to disk).
Works great. However, it takes about ~20 minutes for the compression to finish. Because such compression runs on a 40-core server, I have been experimenting with lbzip2 with excellent results in terms of both compression ratio and speed. I have also used zip -0 to pack all the .bz files into a single .zip container, which I assume is a better option than tar because of random access.
So my question is, how can I read .bz files compressed inside a .zip file? As far as I can tell, gzopen takes a file path as first argument.
You could just stick with your current zip format for random access. Run separate zip commands individually on each text file to turn them into many single entry zip files. Launch all those at once, and your 40 cores will be kept busy until done. Once done, use zipmerge to combine them all into a single zip file.

How do I get the disk addresses of files in C/C++?

When a file is saved into a drive, its contents are written & then indexed. I want to get the indexes and to access the raw contents of the files.
Any idea on the method how to do it, especially for ex4 & btrfs?
UPDATE: I want to get the addresses of the extents of a file. The information about the addresses must be stored somewhere onto the disk. I want to retrieve this info, in order to map the physical location of the file contents. Any methods in order to achieve that?
UPDATE: Hello, all! Thanks for your replies. What I want is a function/command which returns me a list of extent addresses. debugfs seems the function/command with the most-relevant functionality.
It depends of the filesystem you are using. If you are running Linux you can use debufs to seek the file in the filesystem.
I have to say that all FSs are mounted through a VFS, a virtual filesystem that is like a simplified interface with the standard operations (open, close, read...). What is the meaning of that? No filesystem nor its contents(files, dirs) are opened directly from disk, when you open something, you move it to the main memory(your RAM) you do your operations and when you close something it returns to the disk drive.
Now, the question is: Can I get the absolute address in a FS? Yes, if you open your whole filesystem like open ("/dev/sdaX", 0_RDONLY); so you get the address relative to your filesystem using lseek in C for example.
And then... Can I get the same in the whole drive? No, that is because you cannot open the whole drive as a file descriptor. Remember /dev/sdaXin UNIX? Partitions and its can be opened like files because they have a virtual interface running on them.
Your last answer: Can I read really raw contents? All files are read as they appear on disk, the only thing that changes is the descriptor used by the OS and some data about how is indexed, all this as a "file header".
I hope all your questions are answered.
The current solution/workaround is to call these functions with popen:
filefrag -e /path/to/file
hdparm --fibmap /path/to/filename
Then one should simply parse the stringoutputs of these programs. It is not a real solution (i.e.: outputs at C/C++ level), but I'll accept it for now.

Apache Spark: batch processing of files

I have directories, sub directories setup on HDFS and I'd like to pre process all the files before loading them all at once into memory. I basically have big files (1MB) that once processed will be more like 1KB, and then do sc.wholeTextFiles to get started with my analysis
How do I loop on each file (*.xml) on my directories/subdirectories, do an operation (let's say for the example's sake, keep the first line), and then dump the result back to HDFS (new file, say .xmlr) ?
I'd recommend you just to use sc.wholeTextFiles and preprocess them using transformations, after that save all of them back as a single compressed sequence file (you can refer to my guide to do so:
Another option might be to write a mapreduce that would process the whole file at a time and save them to the sequence file as I proposed before: It is the example described in 'Hadoop: The Definitive Guide' book, take a look at it
In both cases you would do almost the same, both Spark and Hadoop will bring up a single process (Spark task or Hadoop mapper) to process these files, so in general both of the approaches will work using the same logic. I'd recommend you to start with a Spark one as it is simpler to implement given the fact you already have a cluster with Spark

Managing log file size

I have a program which logs its activity.
I want to implement a log file mechanism to keep the log file under a certain size, lets say 10 MB.
The log file itself just holds commands the program executed; those commands are variable length.
Right now, the program runs on a windows environment, but I'm likely to port it to UNIX soon.
I've came up with two methods for managing the log files:
1. Keep multiple files of lower size, and if the new command exceeds the current file length, truncate the oldest file to zero size, and start writing there.
2. Keep a header in the file, which holds metadata regarding the first command in the file, and the next place to write to in the file. Also I think, each command should hold metadata about it's length this way.
My questions are as follows:
In terms of efficiency which of these methods would you use, and why?
Is there a unix command / function to this easily?
Thanks a lot for your help,
On UNIX/Linux platforms there's a logrotate program that manages logfiles. Details can be found for example here:

Following multiple log files efficiently

I'm intending to create a programme that can permanently follow a large dynamic set of log files to copy their entries over to a database for easier near-realtime statistics. The log files are written by diverse daemons and applications, but the format of them is known so they can be parsed. Some of the daemons write logs into one file per day, like Apache's cronolog that creates files like access.20100928. Those files appear with each new day and may disappear when they're gzipped away the next day.
The target platform is an Ubuntu Server, 64 bit.
What would be the best approach to efficiently reading those log files?
I could think of scripting languages like PHP that either open the files theirselves and read new data or use system tools like tail -f to follow the logs, or other runtimes like Mono. Bash shell scripts probably aren't so well suited for parsing the log lines and inserting them to a database server (MySQL), not to mention an easy configuration of my app.
If my programme will read the log files, I'd think it should stat() the file once in a second or so to get its size and open the file when it's grown. After reading the file (which should hopefully only return complete lines) it could call tell() to get the current position and next time directly seek() to the saved position to continue reading. (These are C function names, but actually I wouldn't want to do that in C. And Mono/.NET or PHP offer similar functions as well.)
Is that constant stat()ing of the files and subsequent opening and closing a problem? How would tail -f do that? Can I keep the files open and be notified about new data with something like select()? Or does it always return at the end of the file?
In case I'm blocked in some kind of select() or external tail, I'd need to interrupt that every 1, 2 minutes to scan for new or deleted files that shall (no longer) be followed. Resuming with tail -f then is probably not very reliable. That should work better with my own saved file positions.
Could I use some kind of inotify (file system notification) for that?
If you want to know how tail -f works, why not look at the source? In a nutshell, you don't need to periodically interrupt or constantly stat() to scan for changes to files or directories. That's what inotify does.
