Hadoop: Sending Files or File paths to a map reduce job - file

supposed I had N files to process using hadoop map-reduce, let's assume they are large, well beyond the block size and there are only a few hundred of them. Now I would like to process each of these files, let's assume the word counting example.
My question is: What is the difference between creating a map-reduce job whose input is a text file with the paths to each of the files as opposed to sending each of the files directly to the map function i.e. concatenating all the files and pushing them into different mappers [EDIT].
Are these both valid approaches?
Are there any drawbacks to them?
Thanks for the prompt answers I've included a detailed description of my problem since my abstraction may have missed a few important topics:
I have N small files on Hadoop HDFS in my application and I just need to process each file. So I am using a map function to apply a python script to each file (actually image [I've already looked at all the hadoop image processing links out there]), I am aware of the small file problem and the typical recommendation is to group the smaller files so we avoid the overhead of moving files around (the basic recommendation using sequence files or creating your own data structures as in the case of the HIPI).
This makes me wonder can't we just tell each mapper to look for files that are local to him and operate on those?
I haven't found a solution to that issue which is why I was looking at either sending a path to the files to each mapper or the file it self.
Creating a list of path names for each of the collection of images seems to be ok, but as mentioned in the comments I loose the data locality property.
Now when I looked at the hadoop streaming interface it mentions that the different pieces communicate based on stdin and stdout typically used for text files. That's where I get confused, if I am just sending a list of path names this shouldn't be an issue since each mapper would just try to find the collection of images it is assigned. But when I look at the word count example the input is the file which then gets split up across the mappers and so that's when I was confused as to if I should concatenate images into groups and then send these concatenated groups just like the text document to the different mappers or if I should instead concatenate the images leave them in hadoop HDFS and then just pass the path to them to the mapper... I hope this makes sense... maybe I'm completely off here...
Thanks again!

Both are valid. But latter would incur extra overhead and performance will go down because you are talking about concatenating all the files into one and feeding it to just 1 mapper. And by doing that you would go against one of the most basic principles of Hadoop, parallelism. Parallelism is what makes Hadoop so efficient.
FYI, if you really need to do that you have to set isSplittable to false in your InputFormat class, otherwise the framework will split the file(based on your InputFormat).
And as far as input path is considered, you just need to give the path of the input directory. Each file inside this directory will be processed without human intervention.
HTH
In response to your edit :
I think you have misunderstood this a bit. You don't have to worry about localization. Hadoop takes care of that. You just have to run your job and the data will be processed on the node where it is present. Size of the file has nothing to with it. You don't have to tell anything to mappers. Process goes like this :
You submit your job to JT. JT directs the TT running on the node which has the block of data required by the job to start start the mappers. If the slots are occupied by some other process, then same thing takes place on some other node having the data block.

The bottleneck will be there if you are processing the whole concatenated file in a single mapper as you have mentioned.
It won't be a problem is you are providing the concatenated file as input to Hadoop. Since, the large file formed will obviously be distributed in HDFS (I assume you are using HDFS) and will be processed by multiple Mappers and reducers concurrently.

My question is: What is the difference between creating a map-reduce job whose input is a text file with the paths to each of the files as opposed to sending each of the files directly to the map function i.e. concatenating all the files and pushing them into a single mapper.
By listing the files paths in a text file and (i assume) manually opening them in the mapper, you'll be defeating data locality (that is where hadoop will try and run your mapper code where the data is, rather than moving the data to where your code executes. with 1000 files, this will also probably be processed by a single mapper instance (i imagine 1000 lines of text should be less than your block size).
If you concatenate all the files first and then use as input, this will usually be less efficient, mainly because you're copying all the files to a single node (to concatenate them) and then pushing the data back up to HDFS as a single file. This is even before you then get to process the file again in a mapper (or more depending on splittability of your input format split / compression codec).
If you were going to process this concatenated file multiple times, and each file is smaller than the block size, then merging them to a single file may be beneficial, but you've already noted that each file is larger than the default block size.
Is there particular reason you want all files to flow through a single mapper (which is what it sounds like is you are trying to achieve by doing these two options).

Related

Implementing a database in a single file

This question is about creating a new single file database format. I am new to this!
I wonder how SQLite does this- for databases larger than the available memory, SQLite must be reading from certain parts of the file somehow, i.e. reading at position n?
Is this possible at sub-linear runtime complexity? I assume that when SQLite fetches a particular row, it uses a O(logn) index lookup first- so it doesn't fetch the entire index- and then it fetches the row from a particular location in the file. All of this involves not reading the whole file into memory- but FS methods appear not to provide this functionality.
Is fs.skip(n) [pseudocode] done in O(n) or does the OS skip straight to position n? Theoretically this should be possible because in the OS files are divided into blocks- and inodes reference 1-3 levels of array-like structures that locate the blocks, so fetching a particular block in a file should be possible in sub-linear time- without reading in the entire file.
I wonder how SQLite does this- for databases larger than the available memory, SQLite
must be reading from certain parts of the file somehow, i.e. reading at position n?
Yes. Almost every programming language has documentation that explains how to position the read on a file.
All of this involves not reading the whole file into memory- but FS methods appear not to
provide this functionality.
Every file system access API that I know of does support this, and it is explained in the documentation. Examples range from memory-mapped files in Windows (which are "quite" advanced and not supported if you plan to go OS-agnostic), down to something simple like the fseek() method in C that positions a file stream.
I suggest brushing up on your knowledge of file-system access methods in your programming language of choice.

Strategy for mass storage of small files

What is the good strategy for mass storage for millions of small files (~50 KB on average) with auto-pruning of files older than 20 minutes? I need to write and access them from the web server.
I am currently using ext4, and during delete (scheduled in cron) HDD usage spikes up to 100% with [flush-8:0] showing up as the process that creates the load. This load is interferes with other applications on the server. When there are no deletes, max HDD utilisation is 0-5%. Situation is same with nested and non-nested directory structures. The worst part is that it seems that mass-removing during peak load is slower than the rate of insertions, so amount of files that need to be removed grows larger and larger.
I have tried changing schedulers (deadline, cfq, noop), it didn't help. I have also tried setting ionice to removing script, but it didn't help either.
I have tried GridFS with MongoDB 2.4.3 and it performs nicely, but horrible during mass delete of old files. I have tried running MongoDB with journaling turned off (nojournal) and without write confirmation for both delete and insert (w=0) and it didn't help. It only works fast and smooth when there are no deletes going on.
I have also tried storing data in MySQL 5.5, in BLOB column, in InnoDB table, with InnoDB engine set to use innodb_buffer_pool=2GB, innodb_log_file_size=1GB, innodb_flush_log_on_trx_commit=2, but the perfomance was worse, HDD load was always at 80%-100% (expected, but I had to try). Table was only using BLOB column, DATETIME column and CHAR(32) latin1_bin UUID, with indexes on UUID and DATETIME columns, so there was no room for optimization, and all queries were using indexes.
I have looked into pdflush settings (Linux flush process that creates the load during mass removal), but changing the values didn't help anything so I reverted to default.
It doesn't matter how often I run auto-pruning script, each 1 second, each 1 minute, each 5 minutes, each 30 minutes, it is disrupting server significantly either way.
I have tried to store inode value and when removing, remove old files sequentially by sorting them with their inode numbers first, but it didn't help.
Using CentOS 6. HDD is SSD RAID 1.
What would be good and sensible solution for my task that will solve auto-pruning performance problem?
Deletions are kind of a performance nuisance because both the data and the metadata need to get destroyed on disk.
Do they really need to be separate files? Do the old files really need to get deleted, or is it OK if they get overwritten?
If the answer is "no" to the second of these questions, try this:
Keep a list of files that's roughly sorted by age. Maybe chunk it by file size.
When you want to write to a new file, find an old file that's preferably bigger than what you're replacing it with. Instead of blowing away the old file, truncate() it to the appropriate length and then overwrite its contents. Make sure you update your old-files list.
Clean up the really old stuff that hasn't been replaced explicitly once in a while.
It might be advantageous to have an index into these files. Try using a tmpfs full of symbolic links to the real file system.
You might or might not get a performance advantage in this scheme by chunking the files in to manageably-sized subdirectories.
If you're OK with multiple things being in the same file:
Keep files of similar sizes together by storing each one as an offset into an array of similarly-sized files. If every file is 32k or 64k, keep a file full of 32k chunks and a file full of 64k chunks. If files are of arbitrary sizes, round up to the next power of two.
You can do lazy deletes here by keeping track of how stale each file is. If you're trying to write and something's stale, overwrite it instead of appending to the end of the file.
Another thought: Do you get a performance advantage by truncate()ing all of the files to length 0 in inode order and then unlink()ing them? Ignorance stops me from knowing whether this can actually help, but it seems like it would keep the data zeroing together and the metadata writing similarly together.
Yet another thought: XFS has a weaker write ordering model than ext4 with data=ordered. Is it fast enough on XFS?
If mass-removing millions of files results in performance problem, you can resolve this problem by "removing" all files at once. Instead of using any filesystem operation (like "remove" or "truncate") you could just create a new (empty) filesystem in place of the old one.
To implement this idea you need to split your drive into two (or more) partitions. After one partition is full (or after 20 minutes) you start writing to second partition while using the first one for reading only. After another 20 minutes you unmount first partition, create empty filesystem on it, mount it again, then start writing to first partition while using the second one for reading only.
The simplest solution is to use just two partitions. But this way you don't use disk space very efficiently: you can store twice less files on the same drive. With more partitions you can increase space efficiency.
If for some reason you need all your files in one place, use tmpfs to store links to files on each partition. This requires mass-removing millions of links from tmpfs, but this alleviates performance problem because only links should be removed, not file contents; also these links are to be removed only from RAM, not from SSD.
If you don't need to append to the small files, I would suggest that you create a big file and do a sequential write of the small files right in it, while keeping records of offsets and sizes of all the small files within that big file.
As you reach the end of the big file, start writing from its beginning again, while invalidating records of the small files in the beginning that you replace with new data.
If you choose the big file size properly, based new files saving rate, you can get auto-pruning of files older than ~20 minutes almost as you need.

Is there any way to open a file using a diff without patching the original?

Example, I have a 40Mb file, and i want to make some minor changes to it, maybe 20Kb of changes.
I can create a diff between the resulting file and the original, simply enough, either by writing it manually with the application that is making the change, or by taking both the original file and the resulting file and generating the diff from that (using Rabin's polynomial fingerprinting algorithm for example)...
The issue is, in order to read the effective outcome of that diff (the new file), I have to patch the diff to the original and create the resulting new file and read that... this creates 2 40mb files with only 20kb of difference between them. It seems logical that one could use the initial file combined with the diff and parse (for reading anyway) the resulting final file without having to create a whole new copy of it.
I have looked through xdiff and it has the functions to create a diff given 2 files, or to apply a diff as a patch to a file, but none to get a simple file handle when provided with the original file and a diff file.
Does such a thing exist? It would be tremendously helpful for storage space savings on larger files, even if only for read-only (write operations could write to a new diff, possibly).
Examples in any language would be fine, although c, python or php would be great if readily available.
Using TortoiseMerge to View Diffs:
You could use TortoiseMerge to view the diff without creating a patch.
Here's an overview of what that looks like. I am also attaching the guide and a download link. If that doesn't suit you, here is a great list of alternative diff tools.
Further Consideration:
Depending on how often you are making changes and your interest in file size savings you may want to consider using a version control system (perhaps you do already). Common options include SVN, Git, and Mercurial.
What your are describing is a source code control with delta storage: you store many versions of a file, and delta are saved, then you can request entire files which are recomposed on the fly, so you can choose to access them directly (for example with the appropriate lib), or save locally before access.
Search for Subversion, git, mercurial and so on, how they implement their delta storage and you'll have working examples. Git has a maintenance task to do that internally, using delta storage when it considers it profitable. Git is in programmed in C.
Clearly it will give a sample of how to access sequentially this kind of files. Once you've got that composing patches is relatively simple, and if the patch commands list can be accessed efficiently you can as well build a random-access solution (as long as the literal part of the patch and the original are accessible).

Temporary File in C

I am writing a program which outputs a file. This file has two parts of the content. The second part however, is computed before the first. I was thinking of creating a temporary file, writing the data to it. And then creating a permanent file and then dumping the temp file content into the permanent one and deleting that file. I saw some posts that this does not work, and it might produce some problems among different compilers or something.
The data is a bunch of chars. Every 32 chars have to appear on a different line. I can store it in a linked list or something, but I do not want to have to write a linked list for that.
Does anyone have any suggestions or alternative methods?
A temporary file can be created, although some people do say they have problems with this, i personally have used them with no issues. Using the platform functions to obtain a temporary file is the best option. Dont assume you can write to c:\ etc on windows as this isnt always possible. Dont assume a filename incase the file is already used etc. Not using temporary files correctly is what causes people problems, rather than temporary files being bad
Is there any reason you cannot just keep the second part in ram until you are ready for the first? Otherwise, can you work out the size needed for the first part and leave that section of the file blank to come back to fill in later on. This would eliminate the needs of the temporary file.
Both solutions you propose could work. You can output intermediate results to a temporary file, and then later append that file to the file that contains the dataset that you want to present first. You could also store your intermediate data in memory. The right data structure depends on how you want to organize the data.
As one of the other answerers notes, files are inherently platform specific. If your code will only run on a single platform, then this is less of a concern. If you need to support multiple platforms, then you may need to special case some or all of those platforms, if you go with the temporary file solution. Whether this is a deal-breaker for you depends on how much complexity this adds compared to structuring and storing your data in memory.

One large file or multiple small files?

I have an application (currently written in Python as we iron out the specifics but eventually it will be written in C) that makes use of individual records stored in plain text files. We can't use a database and new records will need to be manually added regularly.
My question is this: would it be faster to have a single file (500k-1Mb) and have my application open, loop through, find and close a file OR would it be faster to have the records separated and named using some appropriate convention so that the application could simply loop over filenames to find the data it needs?
I know my question is quite general so direction to any good articles on the topic are as appreciated as much as suggestions.
Thanks very much in advance for your time,
Dan
Essentially your second approach is an index - it's just that you're building your index in the filesystem itself. There's nothing inherently wrong with this, and as long as you arrange things so that you don't get too many files in the one directory, it will be plenty fast.
You can achieve the "don't put too many files in the one directory" goal by using multiple levels of directories - for example, the record with key FOOBAR might be stored in data/F/FO/FOOBAR rather than just data/FOOBAR.
Alternatively, you can make the single-large-file perform as well by building an index file, that contains a (sorted) list of key-offset pairs. Where the directories-as-index approach falls down is when you want to search on key different from the one you used to create the filenames - if you've used an index file, then you can just create a second index for this situation.
You may want to reconsider the "we can't use a database" restriction, since you are effectively just building your own database anyway.
Reading a directory is in general more costly than reading a file. But if you can find the file you want without reading the directory (i.e. not "loop over filenames" but "construct a file name") due to your naming convention, it may be benefical to split your database.
Given your data is 1 MB, I would even consider to store it entirely in memory.
To give you some clue about your question, I'd consider that having one single big file means that your application is doing the management of the lines. Having multiple small files is relying an the system and the filesystem to manage the data. The latter can be quite slow though, because it involves system calls for all your operations.
Opening File and Closing file in C Would take much time
i.e. you have 500 files 2 KB each... and if you process it 1000 Additonal Operation would be added to your application (500 Opening file and 500 Closing)... while only having 1 file with 1 MB of size would save you that 1000 additional operation...(That is purely my personal Opinion...)
Generally it's better to have multiple small files. Keeps memory usage low and performance is much better when searching through it.
But it depends on the amount of operations you'll need, because filesystem calls are much more expensive when compared to memory storage for instance.
This all depends on your file system, block size and memory cache among others.
As usual, measure and find out if this is a real problem since premature optimization should be avoided. It may be that using one file vs many small files does not matter much for performance in practice and that the choice should be based on clarity and maintainability instead.
(What I can say for certain is that you should not resort to linear file search, use a naming convention to pinpoint the file in O(1) time instead).
The general trade off is that having one big file can be more difficult to update but having lots of little files is fiddly. My suggestion would be that if you use multiple files and you end up having a lot it can get very slow traversing a directory with a million files in it. If possible break the files into some sort of grouping so they can be put into separate directories and "keyed". I have an application that requires the creation of lots of little pdf documents for all user users of the system. If we put this in one directory it would be a nightmare but having a directory per user id makes it much more manageable.
Why can't you use a DB, I'm curious? I respect your preference, but just want to make sure it's for the right reason.
Not all DBs require a server to connect to or complex deployment. SQLite, for instance, can be easily embedded in your application. Python already has it built-in, and it's very easy to connect with C code (SQLite itself is written in C and its primary API is for C). SQLite manages a feature-complete DB in a single file on the disk, where you can create multiple tables and use all the other nice features of a DB.

Resources