Multithreaded compression, random access and on-the-fly reading - zlib

I have a program running on linux which generates thousand of text files. I want these files to be packed into a single (compressed) file.
The compressed file will later be opened by a C program, which needs to access specific files inside that container, in a random fashion.
The whole thing is working as follows:
Linux program generates thousands of small files
zip -9 out.zip *
C program with libzip accesing specific files inside .zip, depending on what the user requests. These reads are done on memory (no writing decompressed files to disk).
Works great. However, it takes about ~20 minutes for the compression to finish. Because such compression runs on a 40-core server, I have been experimenting with lbzip2 with excellent results in terms of both compression ratio and speed. I have also used zip -0 to pack all the .bz files into a single .zip container, which I assume is a better option than tar because of random access.
So my question is, how can I read .bz files compressed inside a .zip file? As far as I can tell, gzopen takes a file path as first argument.

You could just stick with your current zip format for random access. Run separate zip commands individually on each text file to turn them into many single entry zip files. Launch all those at once, and your 40 cores will be kept busy until done. Once done, use zipmerge to combine them all into a single zip file.

Related

Python reading of files

I am new with Python and I am facing my first troubles.
I have to read some .dat files (100), and each file contains a set of 5000 power traces. The total amount of memory taken by the files is almost 10 GB, so I cannot read the files all toghether because I fill the RAM. So, the np.fromfile method with a for loop in every files is not usefull.
I would like to make a memory mapping, reading just few files at time, but I need to handle the data at the same time.
Do you have some suggestion?
Cheers

Safely writing to and reading from the same file with multiple processes on Linux and Mac OS X

I have three processes designed to run constantly in both Linux and Mac OS X environments. One process (the Downloader) downloads and stores a local copy of a large XML file every 30 seconds. Two other processes (the Workers) use the stored XML file for input. Each Worker starts and runs at random times. Since the XML file is big, it takes a long time to download. The Workers also take a long time to read and parse it.
What is the safest way to setup the processes so the Downloader doesn't clobber the stored file while the Workers are trying to read it?
For Linux and Mac OS X machines that use inode based file systems, use temporary files to store the data while its being downloaded (and is an incomplete state). Once the download is complete, move the temporary file into its final location with an atomic action.
For a little more detail, there are two main things to watch out for when one process (e.g. Downloader) writes a file that's actively read by other processes (e.g. Workers):
Make sure the Workers don't try to read the file before the Downloader has finished writing it.
Make sure the Downloader doesn't alter the file while the Workers are reading it.
Using temporary files accommodates both of these points.
For a more specific example, when the Downloader is actively pulling the XML file, have it write to a temporary location (e.g. 'data-storage.tmp') on the same device/disk* where the final file will be stored. Once the file is completely downloaded and written, have the Downloader move it to its final location (e.g. 'data-storage.xml') via an atomic (aka linearizable) rename command like bash's mv.
* Note that the reason the temporary file needs to be on the same device as the final file location is to ensure the inode number stays the same and the rename can be done atomically.
This methodology ensures that while the file is being downloaded/written the Workers won't see it since it's in the .tmp location. Because of the way renaming works with inodes, it also make sure that any Worker that opened the file continues to see the old content even if a new version of the data-storage file is put in place.
Downloader will point 'data-storage.xml' to a new inode number when it does the rename, but the Worker will continue to access 'data-storage.xml' from the previous inode number thereby continuing to work with the file in that state. At the same time, any Worker that opens a new copy 'data-storage.xml' after Downloader has done the rename will see contents from the new inode number since it's now what is referenced directly in the file system. So, two Workers can be reading from the same filename (data-storage.xml) but each will see a different (and complete) version of the contents of the file based on which inode the filename was pointed to when the file was first opened.
To see this in action, I created a simple set of example scripts that demonstrate this functionality on github. They can also be used to test/verify that using a temporary file solution works in your environment.
An important note is that it's the file system on the particular device that matters. If you are using a Linux or Mac machine but working with a FAT file system (for example, a usb thumb drive), this method won't work.

Managing log file size

I have a program which logs its activity.
I want to implement a log file mechanism to keep the log file under a certain size, lets say 10 MB.
The log file itself just holds commands the program executed; those commands are variable length.
Right now, the program runs on a windows environment, but I'm likely to port it to UNIX soon.
I've came up with two methods for managing the log files:
1. Keep multiple files of lower size, and if the new command exceeds the current file length, truncate the oldest file to zero size, and start writing there.
2. Keep a header in the file, which holds metadata regarding the first command in the file, and the next place to write to in the file. Also I think, each command should hold metadata about it's length this way.
My questions are as follows:
In terms of efficiency which of these methods would you use, and why?
Is there a unix command / function to this easily?
Thanks a lot for your help,
Nihil.
On UNIX/Linux platforms there's a logrotate program that manages logfiles. Details can be found for example here:
http://linuxcommand.org/man_pages/logrotate8.html

Reading a file directly from HDFS into a shell function

I have a shell function that is called from inside my map function. The shell function takes 2 parameters -> an input file and an output file. Something like this
$> unix-binary /pathin/input.txt /pathout/output.txt
The problem is, that these input.txt files reside in HDFS and the output.txt files need to be written back to HDFS. Currently, I first copy the needed file with fs.copyToLocalFile into the local hard drive, call the unix binary and then write the output.txt back to HDFS with fs.copyFromLocalFile.
The problem with this approach is that, it is not optimal because it involves substantial amount of redundant reading and writing to HDD which slows down the performance. So, my question is, how I can read the HDFS file directly as an input and output the results directly to HDFS?
obviously,
$>unix-binary hdfs://master:53410/pathin/input.txt' hdfs://master:54310/pathout/output.txt
will not work. Is there any other way around? Can I treat an HDFS file as a loacl file somehow?
I have access to the unix-binary source code written in C. Maybe changing the source code would help?
thanks
You can add the file to the DistributedCache and access it from the mapper from the cache. Call your shell function on the local file and write the output file to local disk and then copy the local file to HDFS.
However, operations such as calling shell functions, or reading/writing from within a mapper/reducer break the MapReduce paradigm. If you find yourself needing to perform such operations, MapReduce may not be the solution you're looking for. HDFS and MapReduce were designed to perform massive scale batch processing on small numbers of extremely large files.
Since you have access to unix-binary source code, your best option might be to implement the particular function(s) you want in java. Feed the input files to your mapper and call the function from the mapper on the data rather than working with files on HDFS/LocalFS.

Following multiple log files efficiently

I'm intending to create a programme that can permanently follow a large dynamic set of log files to copy their entries over to a database for easier near-realtime statistics. The log files are written by diverse daemons and applications, but the format of them is known so they can be parsed. Some of the daemons write logs into one file per day, like Apache's cronolog that creates files like access.20100928. Those files appear with each new day and may disappear when they're gzipped away the next day.
The target platform is an Ubuntu Server, 64 bit.
What would be the best approach to efficiently reading those log files?
I could think of scripting languages like PHP that either open the files theirselves and read new data or use system tools like tail -f to follow the logs, or other runtimes like Mono. Bash shell scripts probably aren't so well suited for parsing the log lines and inserting them to a database server (MySQL), not to mention an easy configuration of my app.
If my programme will read the log files, I'd think it should stat() the file once in a second or so to get its size and open the file when it's grown. After reading the file (which should hopefully only return complete lines) it could call tell() to get the current position and next time directly seek() to the saved position to continue reading. (These are C function names, but actually I wouldn't want to do that in C. And Mono/.NET or PHP offer similar functions as well.)
Is that constant stat()ing of the files and subsequent opening and closing a problem? How would tail -f do that? Can I keep the files open and be notified about new data with something like select()? Or does it always return at the end of the file?
In case I'm blocked in some kind of select() or external tail, I'd need to interrupt that every 1, 2 minutes to scan for new or deleted files that shall (no longer) be followed. Resuming with tail -f then is probably not very reliable. That should work better with my own saved file positions.
Could I use some kind of inotify (file system notification) for that?
If you want to know how tail -f works, why not look at the source? In a nutshell, you don't need to periodically interrupt or constantly stat() to scan for changes to files or directories. That's what inotify does.

Resources