Hadoop job taking input files from multiple directories - file

I have a situation where I have multiple (100+ of 2-3 MB each) files in compressed gz format present in multiple directories. For Example
A1/B1/C1/part-0000.gz
A2/B2/C2/part-0000.gz
A1/B1/C1/part-0001.gz
I have to feed all these files into one Map job. From what I see , for using MultipleFileInputFormat all input files need to be in same directory . Is it possible to pass multiple directories directly into the job?
If not , then is it possible to efficiently put these files into one directory without naming conflict or to merge these files into 1 single compressed gz file.
Note: I am using plain java to implement the Mapper and not using Pig or hadoop streaming.
Any help regarding the above issue will be deeply appreciated.
Thanks,
Ankit

FileInputFormat.addInputPaths() can take a comma separated list of multiple files, like
FileInputFormat.addInputPaths("foo/file1.gz,bar/file2.gz")

Related

Moving files with a certain file name from one folder to another

i currently have a folder containing thousands of 10-K financial reports.
All of these files are .txt files.
The files look like this:
20170103_10-K_edgar_data_1662618_0001376474-17-000004_1.txt
20170104_10-Q_edgar_data_886137_0000886137-17-000006_1.txt
20170106_10-K-A_edgar_data_1590496_0001477932-17-000065_1.txt
20170106_10-Q_edgar_data_1275187_0001275187-17-000005_1.txt
I only need files that are "10-K" files and contain a certain CIK-number.
e.g. only the files which contain the string "10-K_edgar_data" and the CIK-number "0001376474"
Is there a way to achieve this for multiple 10-K files with different CIK-numbers?
EDIT: I use Mac OS High Sierra. I also have python 2 & 3 and I use it with PyCharm. But for now I only know the absolute Python basics (I am still learning the language...)

How would I store different types of data in one file

I need to store data in a file in this format
word, audio, jpeg
How would I store that all in one file? Is it even possible do would I need to store links to other data files in place of the audio and jpeg. Would I need a custom file format?
1. Your own filetype
As mentioned by #Ken White you would need to be creating your own custom file format for this sort of thing, which would then mean creating your own parser type. This could be achieved in almost any language you wanted but since you are planning on using word format, then maybe C# would be best for you. However, this technique could be quite complicated and take a relatively large amount of time to thoroughly test your file compresser / decompressor, but may be best depending on your needs.
2. Command line utilities
Another way to go about this would be to use a bash script to combine all of the files into one file, and then decompress it at the other end. For example the steps could involve:
Combine files using windows copy / linux cat command on command line
Create a metdata file of your own that says how many files are in this custom file, and how much memory each one takes up (could be a short XML or JSON file for example...)
Use the linux split command or install a Windows command line file splitter program (here's just one example) to split the file back into whatever components have made it up.
This way you only have to create a really small file type, and let the OS utilities handle the combining of them for you.
Example on Windows:
Copy all of the files in your current directory into one output file called 'file.custom'
copy /b * file.custom
Generate custom file format describing metadata (i.e. get the file size on disk in C# example here). This is just maybe what I would do in JSON. SO formatting was being annoying so here's a link (Copy paste it into an editor or online JSON viewer).
Use a decompress windows / linux command line tool to decompress each files to the exact length (and export it back to the exact name) specified in the JSON (metadata) file. (More info on splitting files on this post).
3. ZIP files
You could always store all of the files in a compressed zip file, and then just use a zip compressor, expander as and when you like to retreive any number of file formats stored within.
I found a couple of examples of :
Combining multiple files into one ZIP file in only C# .net,
Unzipping ZIP files in C#
Zipping & Unzipping with only windows built-in utilities
Zipping & Unzipping in Linux command line
Good Zipping/Unzipping library in Java
Zipping/Unzipping in Python

How do I copy a single PDF file and batch paste it into hundreds of files with a specific naming convention based on a source .csv or Excel file?

I'm trying to copy a single PDF document into several hundred PDF documents, but they have to have a specific naming convention based on specific user data stored in a .csv (or text, or Excel, any of these sources is fine). I imagine I could use PowerShell for this, but don't know how to approach.
Example:
Take a single PDF - sample.pdf - and rename into 00003021_20160204.pdf where the '00003021_20160204' is one specific user record in the source .csv. I would then copy sample.pdf and repeat the above process several hundred times until my source list was complete.
Note, these are sequential numbers, so I have to rely on the source data.
I tried search stack and google, but found nothing. Though I'm open to suggestions on how I (or others) might better search for this solution in the future.
Thanks!
for /F %%p in (files.txt) do copy %1 "%%p.pdf"
where files.txt has a number on each line and the first parameter is the pdf you want to copy.

Copying multiple files to a Winscp directory via Script

I have a problem when trying to upload multiple files to one WinSCP directory, i can manage to copy just one single file, but the problem is that i need to upload many files that are generated by a software, the names are not fixed ones, so i need to make use of wildcards in roder to copy all of them, i have tried many variants on the code, but it all was unsuccessful, the code i am using is:
open "sftp://myserver:MyPass#sfts.us.myserver.com" -hostkey="hostkey"
put "C:\from*.*" "/Myserverfolder/Subfolder/"
exit
This code does actually copy the first alphabetically named file, but it ignores the rest of the files.
Any help with it would be much appreciated
Try this in script
Lcd C:\from
Cd Myserverfolder/Subfolder
Put *
Try and do all manually first so you can see what's going on.

Storing Large Number Of Files in File-System

I have millions of audio files, generated based on GUId (http://en.wikipedia.org/wiki/Globally_Unique_Identifier). How can I store these files in the file-system so that I can efficiently add more files in the same file-system and can search for a particular file efficiently. Also it should be scalable in future.
Files are named based on GUId (unique file name).
Eg:
[1] 63f4c070-0ab2-102d-adcb-0015f22e2e5c
[2] ba7cd610-f268-102c-b5ac-0013d4a7a2d6
[3] d03cf036-0ab2-102d-adcb-0015f22e2e5c
[4] d3655a36-0ab3-102d-adcb-0015f22e2e5c
Pl. give your views.
PS: I have already gone through < Storing a large number of images >. I need the particular data-structure/algorithm/logic so that it can also be scalable in future.
EDIT1: Files are around 1-2 millions in number and file system is ext3 (CentOS).
Thanks,
Naveen
That's very easy - build a folder tree based on GUID values parts.
For example, make 256 folders each named after the first byte and only store there files that have a GUID starting with this byte. If that's still too many files in one folder - do the same in each folder for the second byte of the GUID. Add more levels if needed. Search for a file will be very fast.
By selecting the number of bytes you use for each level you can effectively choose the tree structure for your scenario.
I would try and keep the # of files in each directory to some manageable number. The easiest way to do this is name the subdirectory after the first 2-3 characters of the GUID.
Construct n level deep folder hierarchy to store your files. The names of the nested folders will the first n bytes of the corresponding file name. For example: For storing a file "63f4c070-0ab2-102d-adcb-0015f22e2e5c" in a four level deep folder hierarchy, construct 6/3/f/4 and place this file in this hierarchy. The depth of the hierarchy depends on the maximum number of files you can have in your system. For a few million files in my project 4 level deep hierarchy works well.
I also did the same thing in my project having nearly 1 million files. My requirement was also to process the files by traversing this huge list. I constructed a 4 level deep folder hierarchy and the processing time reduced from nearly 10 minutes to a few seconds.
An add on to this optimization can be that, if you want to process all the files present in these deep folder hierarchies, then instead of calling a function to fetch the list for the first 4 levels just precompute all the possible 4 level deep folder hierarchy names. Suppose the guid can have 16 possible characters then we will have 16 folders each at the first four levels, we can just precompute the 16*16*16*16 folder hierarchies which takes just a few ms. This save a lot of time if these large number of files are stored at a shared location and calling a function to fetch the list in a directory takes nearly a second.
Sorting the audio files into separate subdirectories may slower if dir_index is used on the ext3 volume. (dir_index: "Use hashed b-trees to speed up lookups in large directories.")
This command will set the dir_index feature: tune2fs -O dir_index /dev/sda1

Resources