Where should Map put temporary files when running under Hadoop

Where should Map put temporary files when running under Hadoop - file

I am running Hadoop 0.20.1 under SLES 10 (SUSE).
My Map task takes a file and generates a few more, I then generate my results from these files. I would like to know where I should place these files, so that performance is good and there are no collisions. If Hadoop can delete the directory automatically - that would be nice.
Right now, I am using the temp folder and task id, to create a unique folder, and then working within subfolders of that folder.
reduceTaskId = job.get("mapred.task.id");
reduceTempDir = job.get("mapred.temp.dir");
String myTemporaryFoldername = reduceTempDir+File.separator+reduceTaskId+ File.separator;
File diseaseParent = new File(myTemporaryFoldername+File.separator +REDUCE_WORK_FOLDER);
The problem with this approach is that I am not sure it is optimal, also I have to delete each new folder or I start to run out of space.
Thanks
akintayo
(edit)
I found that the best place to keep files that you don't want beyond the life of map would be job.get("job.local.dir") which provides a path that will be deleted when the map tasks finishes. I am not sure if the delete is done on a per key basis or for each tasktracker.

The problem with that approach is that the sort and shuffle is going to move your data away from where that data was localized.
I do not know much about your data but the distributed cache might work well for you
${mapred.local.dir}/taskTracker/archive/ : The distributed cache. This directory holds the localized distributed cache. Thus localized distributed cache is shared among all the tasks and jobs
http://www.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/
"It is common for a MapReduce program to require one or more files to be read by each map or reduce task before execution. For example, you may have a lookup table that needs to be parsed before processing a set of records. To address this scenario, Hadoop’s MapReduce implementation includes a distributed file cache that will manage copying your file(s) out to the task execution nodes.
The DistributedCache was introduced in Hadoop 0.7.0; see HADOOP-288 for more detail on its origins. There is a great deal of existing documentation for the DistributedCache: see the Hadoop FAQ, the MapReduce Tutorial, the Hadoop Javadoc, and the Hadoop Streaming Tutorial. Once you’ve read the existing documentation and understand how to use the DistributedCache, come on back."

Related

How to deal with issues when storing uploaded files in the file system for a web app?

I am building a web application where the users can create reports and then upload some images for the created reports. Those images will be rendered in the browser when the user clicks a button on the report page. The images are confidential and only authorized users will be able to access them.
I am aware of the pros and cons of storing images in database, in filesystem or a service like amazon S3. For my application, I am inclined to keep the images in the filesystem and paths of the images in the database. That means I have to deal with the problems arising around distributed transaction management. I need some advice on how to deal with these problems.
1- I believe one of the proper solutions is to use technologies like JTA and XADisk. I am not very knowledgeable about these technologies but I believe 2 phase commit is how automicity is achieved. I am using MySQL as the database, and it seems like 2 phase commit is supported by MySQL. Problem with this approach is XADisk does not seem to be an active project and there is not much documentation about it and there is the fact that I am not very knowlegable about the ins and outs of this approach. I am not sure if I should invest in this approach.
2- I believe I can get away with some of the problems arising from the violation of ACID properties for my application. While uploading images, I can first write the files to disk, if this operation succeeds I can update the paths in the database. If database transaction fails, I can delete the files from the disk. I know that is still not bulletproof; an electricity shortage might occur just after the db transaction or the disk might not be responsive for a while etc...I know there are also concurrency issues, for instance if one user tries to modify the uploaded image and another tries to delete it at the same time, there will be some problems. Still the chances for concurrent updates in my application will be relatively low.
I believe I can live with orphan files on the disk or orphan image paths on the db if such exceptional cases occur. If a file path exists in db and not in the file system, I can show a notification to the user on report page and he might try to reupload the image. Orphan files in the file system would not be too much problem, I might run a process to detect such files time to time. Still, I am not very comfortable with this approach.
3- The last option might be to not store file paths in the db at all. I can structure the filesystem such that I can infer the file path in code and load all images at once. For instance, I can create a folder with the name of report id for each report. When a request has been made to load images of the report, I can load the images at once since I know the report id. That might end up with huge number of folders in the filesystem and I am not sure if such a design is acceptable. Concurrency issues will still exist in this scheme.
I would appreciate some advice on which approach I should follow.

I believe you are trying to be ultra-correct, and maybe not that much is needed, but I also faced some similar situation some time ago and explored also different possibilities. I disliked options aligned to your option 1, but about the 2 and 3, I had different successful approaches.
Let's sum up first the list of concerns:
You want the file to be saved
You want the file path to be linked to the corresponding entity (i.e the report)
You don't want a file path to be linked to a file that doesn't exist
You don't want files in the filesystem not linked to any report
And the different approaches:
1. Using DB
You can assure transactions in the DB pretty much with any relational database, and with S3 you can ensure read-after-write consistency for both new objects and upload of new objects. If you PUT an object and you get a 200 OK, it will be readable. Now, how to put all this together? You need to keep track of the process. I can figure 2 ways:
1.1 With a progress table
The upload request is saved to a table with anything need to identify this file, report id, temp uploaded file path, destination path, and a status column
You save the file
If the file safe fails you can update the record in the table, or delete it
If saving the file is successful, in a transaction:
update the progress table with successful status
update the table where you actually save the relationship report-image
Have a cron, but not checking the filesystem, but checking the process table. If there is any file in the filesystem that is orphan, definitely it had been added to the table (it was point 1). Here you can decide if you will delete the file, or if you have enough info, you can continue with the aborted process triggering the point 4.
The same report-image relationship table with some extra status columns.
1.2 With a queue system
Like RabbitMQ, SQS, AMQ, etc
A very similar approach could be done with any queue system instead of a db table. I wont give much details because it depends more on your real infrastructure, but just the general idea.
The upload request goes to a queue, you send a message with anything you may need to identify this file, report id, and if you want a tentative final path.
You upload the file
A worker reads pending messages in the queue and does the work. The message is marked as consumed only when everything goes well.
If something fails, naturally the message will come back to the queue
In the next time a message is read, the worker can have enough info to see if there is work to resume, or even a file to delete if resuming is not possible
In both cases, concurrency problems wont be straightforward to manage, but can be managed (relying on DB locks in fist case, and FIFO queues in second cases) but always with some application logic
2. Without DB
To some extent a system without a database would be perfectly acceptable, if we can defend it as a proper convention over configuration design.
You have to deal with 3 things:
Save files
Read files
Make sure that the structure of the filesystem is manageable
Lets start with 3:
Folder structure
In general, something like one folder for report id will be too simple, and maybe hard to maintain, and also ultimately too plain. This will cause issues, because if we have a folder images with one folder per report, and tomorrow you have less say 200k reports, the images folder will have 200k elements, and even an ls will take too much time, same for any programing language trying to access. That will kill you
You can think about something more sophisticated. Personally like a way that I learnt from Magento 1 more than 10 years ago and I used a lot since then: Using a folder structure following first outside rules, but extended with rules derived extended with the file name itself.
We want to save a product image. The image name is: myproduct.jpg
first rule is: for product images i use /media/catalog/product
then, to avoid many images in the same one, i create one folder per every letter of the image name, up to some number of letters. Lets say 3. So my final folder will be something like /media/catalog/product/m/y/p/myproduct.jpg
like this, it is clear where to save any new image. You can do something similar using your reports id, categories, or anything that makes sense for you. The final objective is to avoid too flat structure, and to create a tree that makes sense to you, and also that can be automatized easily.
And that takes us to the next part:
Read and write.
I implemented a similar system before quite successfully. It allowed me to save files easy, and to retrieve them easily, with locations that were purely dynamic. The parts here were:
S3 (but you can do with any filesystem)
A small microservice acting as a proxy for both read and write.
Some namespace system and attached logic.
The logic is quite simple. The namespace lets me know where the file will be saved. For example, the namespace can be companyname/reports/images.
Lets say a develop a microservice for read and write:
For saving a file, it receives:
namespace
entity id (ie you report)
file to upload
And it will do:
based on the rules I have for that namespace, and the id and file name will save the file in this folder
it doesn't return the physical location. That remains unknown to the client.
Then, for reading, clients will use a URL that uses also convention. For example you can have something like
https://myservice.com/{NAMESPACE}/{entity_id}
And based on the logic, the microservice will know where to find that in the storage and return the image.
If you have more than one image per report, you can do different things, such as:
- you may want to have a third slug in the path such as https://myservice.com/{NAMESPACE}/{entity_id}/1 https://myservice.com/{NAMESPACE}/{entity_id}/2 etc...
- if it is for your internal application usage, you can have one endpoint that returns the list of all eligible images, lets say https://myservice.com/{NAMESPACE}/{entity_id} returns an array with all image urls
How I implemented this was with quite simple yml config to define the logic, and very simple code reading that config. That allowed me to have a lot of flexibility. For example save reports in total different paths or servers or s3 buckets if they belong to different companies or are different report types

How to process multiple text files at a time to analysis using mapreduce in hadoop

I have lots of small files , say more than 50000. i need to process these files at a time using map reduce concept to generate some analysis based on the input files.
Please suggest me a way to do this and also please let me know how to merge this small files into a big file using hdfs

See this blog post from cloudera explaining the problem with small files.
There is a project in github named FileCrush which does merge large number of small files. From project's homepage:
Turn many small files into fewer larger ones. Also change from text to sequence and other compression options in one pass.

How to move large number of files files on disk to HDFS sequence files

I want to move large number of small files to HDFS sequence file(s). I have come across two options:
Use Flume. Flume does not have a built in file source and this requires a custom source to push the files.
Use apache camel file to hdfs route.
Even though the above two methods serve the purpose, I would like to weigh other options available before picking one. In particular i am interested in a solution that is more configurable and results in less maintainable code.

Use Flume. Flume does not have a built in file source and this requires a custom source to push the files.
Umm... no, that's not right. Flume has a Spooling Directory Source which would do the high level of what you want.

Seems like a few lines of code with Camel. i.e. from("file:/..").to("hdfs:..") plus some init and project setup.
Not sure how much easier (less lines of code) you can do it using any method.
If the HDFS options in Camel is enough for configuration and flexibility, then I guess this approach is the best. Should take you just a matter of hours (or even minutes) to have some test cases up and running.

What is better for performance - many files in one directory, or many subdirectories each with one file?

While building web applications often we have files associated with database entries, eg: we have a user table and each category has a avatar field, which holds the path to associated image.
To make sure there are no conflicts in filenames we can either:
rename files upon upload to ID.jpg; the path would be then /user-avatars/ID.jpg
or create a sub-directory for each entity, and leave the original filename intact; the path would be then /user-avatars/ID/original_filename.jpg
where ID is users's unique ID number
Both perfectly valid from application logic's point of view.
But which one would be better from filesystem performance point of view? We have to keep in mind that the number of category entries can be very high (milions).
Is there any limit to a number of sub-directories a directory can hold?

It's going to depend on your file system, but I'm going to assume you're talking about something simple like ext3, and you're not running a distributed file system (some of which are quite good at this). In general, file systems perform poorly over a certain number of entries in a single directory, regardless of whether those entries are directories or files. So no matter whether if you're creating one directory per image or one image in the root directory, you will run into scaling problems. If you look at this answer:
How many files in a directory is too many (on Windows and Linux)?
You'll see that ext3 runs into limits at about 32K entries in a directory, far fewer than you're proposing.
Off the top of my head, I'd suggest doing some rudimentary sharding into a multilevel directory tree, something like /user-avatars/1/2/12345/original_filename.jpg. (Or something appropriate for your type of ID, but I am interpreting your question to be about numeric IDs.) Doing that will also make your life easier later when you decide you want to distribute across a storage cluster, since you can spread the directories around.

Millions of entries (either files or directories) in one parent directory would be hard to deal with for any filesystem. While modern filesystems use sorting and various tree algorithms for quick search for the needed files, even navigating to the folder with Windows Explorer or Midnight Commander or any other file manager will be complicated as the file manager would have to read contents of the directory. The same applies to file search. So subdirectories are preferred for this.
Yet I need to notice that access to particular file would be a bit faster when all files are in one directory than when they are separated into subdirectories at least on NTFS (measured this myself several times with 400K files).

I've been having a very similar issue with html files not images. Trying to store millions of them in a Ubuntu server in ext4. Ended running my own benchmarks. Found out that flat directory performs way better while being way simpler to use:
Reference: article

If you really want to use files, maybe your best bet is to partition the files off into several subdirectories so that you don't hit a limit. For example, if you have an ID 123456, you can put it in /12/34/56.jpg.
However, I would recommend just using the database to store this data since you are already using one. You can store the image data and ID in the same table, and you don't have to worry about some of the pesky business of dealing with files like making sure the permissions are set right, etc.

Can I read AND write to a db in my main bundle using core data?

With the particular app I am working on, I have a significant amount of data that I need to have in my db so I can read it in. I also have the need to write a few things to the db. I took a copy of the sqlite db out of the documents folder and put it into my main bundle and can read my manually inserted data without problems.
I am now trying to insert data, but I am running into difficulty. I remember reading somewhere that you can't write to a db in the main bundle? Only the documents folder? Is that correct? What are my options if I need to have custom data in a core data db that I also need write to?
Should I move it out of the main into the documents folder?
Thanks!

I can't find documentation to back this up, but it is my understanding that the application bundle is read-only. I have read that if you have a pre-populated Core Data store in the app bundle, you need to copy it to the Documents directory - and then make modifications that copy.
Check out this.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight