Write output in different files for different input files using mapreduce - file

How to write output in different files for different input files using mapreduce for example
suppose i want to calculate term frequency of terms per file from video.txt and outlier.txt , store result in video1.txt and oulier1.txt respectively?

In you mapper append the filename to each word you find. Your key would then be 'word+filename'. Make sure that your partitioner uses the 'filename' for partitioning so that all words from the same file will end up with the same reducer

Related

camel aggregate lines and split into files of different sizes

My route read a file with a number of lines and filter some lines out.
It split the file on lines and filter and aggregate to a file.
The file uri is in append mode so each aggregation is appended to it. A done file is created everytime I write to it.
After the file is fully written to, another route picks up the file.
This route split the file into files of n files of equal number of records. But I am running into an issue where the done file is updated for every aggregation in step 1.
How do I update the done file only when the aggregation is fully done ?
I tried to use property ${exchangeProperty.CamelBatchComplete} in the route1.
But that property is always set to true on aggregation...
Its harder to help with just a bit confusing description of your use-case without some basic code example. However you can just write the done file yourself when you are done, its a few lines of Java code

Split of XML files

I am working on xml file but unfortunately my xml file is become large. So now I want to split my xml file into multiple smaller xml files. Is it possible to split one large xml file into multiple smaller xml files.
For E.g. If we make any project in c language then we create multiple c files but the main function will always be present in one c file. All other functions or sub programs we keep in different c files. So if we have to call any function we call it from the c file which is having main function.
Same or similar to that I want in my xml file where there will be one main xml file and all other xml file would be dependent on the main xml file.
In simple words I want to split my large xml file into smaller xml files. I don't have any idea about it. I request you all that please share an example or link for any example of this kind of thing.
Thanks
If you just want to split the file into smaller parts you can use the split command in terminal.
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N use suffixes of length N (default 2)
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes use numeric suffixes instead of alphabetic
-l, --lines=NUMBER put NUMBER lines per output file
--verbose print a diagnostic to standard error just
before each output file is opened
--help display this help and exit
--version output version information and exit

How can I transform data from (.ddm .pnt .fdt .bin) files to .csv

I have data stored in .ddm, .pnt, .fdt and .bin files.
How can I export (or extract or transform) data from those file formats into .csv?
I think it's an ADABAS database.
Yes. The file extensions looks like an adabas database.
You need an adabas/natural enviroment for running database and you can write a simple program in Natural for read database content and put them to text "work file" with delimiters ";" and csv extensions. J don't met any tool for manual unpack database files.
As peterozgood pointed out, you would normally use Natural for that.
If you're using Natural on Windows or Unix you can code the following
DEFINE WORK FILE nn TYPE 'CSV'
...where nn is a number between 1 & 32, identifying the desired workfile.
(this may also be specified by your Admin in the so-called Natparm, along with Codepage & Delimiter)
Then you can output data to the file by coding
WRITE WORK FILE nn operand1 ... operandN
Natural will automatically create the csv.
Fields will be separated by the delimiter and quoted and escaped as necessary.
(the delimiter may be specified in the Natparm or as a startup parameter)
Unfortunately this functionality is not available with Mainframe Natural.
(CSV that is. Workfiles are of course available)

How to name a Matlab output file using input from a text file

I am trying to take an input from a text file in this format:
Processed_kplr010074716-2009131105131_llc.fits.txt
Processed_kplr010074716-2009166043257_llc.fits.txt
Processed_kplr010074716-2009259160929_llc.fits.txt
etc.... (there are several hundred lines)
and use that input to name my output files for a Matlab loop. Each time the loop ends, i would like it to process the results and save them to a file such as:
Matlab_Processed_kplr010074716-2009131105131_llc.fits.txt
This would make identifying the object which has been processed easier as I can then just look for the ID number and not of to sort through a list of random saved filenames. I also need it to save plots that are generated in each loop in a similar fashion.
This is what I have so far:
fileNames = fopen('file_list_1.txt', 'rt');
inText = textscan(fileNames, '%s');
outText = [inText]';
fclose(fileNames)
for j:numel(Data)
%Do Stuff
save(strcat('Matlab_',outText(j),'.txt'))
print(Plot, '-djpeg', strcat(outText(j),'.txt'))
end
Any help is appreciated, thanks.
If you want to use the save command to save to a text file, you need to use -ascii tab, see the documentation for more details. You might also want to use dlmwrite instead(or even fprintf, but I don't believe you can write the whole matrix at once with fprintf, you have to loop over the rows).

Concatenate a large number of HDF5 files

I have about 500 HDF5 files each of about 1.5 GB.
Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable number of samples.
Now I want to concatenate all this files by concatenating each of the datasets so that at the end I have a single 750 GB file with my 7 datasets.
Currently I am running a h5py script which:
creates a HDF5 file with the right datasets of unlimited max
open in sequence all the files
check what is the number of samples (as it is variable)
resize the global file
append the data
this obviously takes many hours,
would you have a suggestion about improving this?
I am working on a cluster, so I could use HDF5 in parallel, but I am not good enough in C programming to implement something myself, I would need a tool already written.
I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).
Then I create the global h5file setting the total length to the sum of all the files.
Only after this phase I fill the h5file with the data from all the small files.
now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.
I get that answering this earns me a necro badge - but things have improved for me in this area recently.
In Julia this takes a few seconds.
Create a txt file that lists all the hdf5 file paths (you can use bash to do this in one go if there are lots)
In a loop read each line of txt file and use label$i = h5read(original_filepath$i, "/label")
concat all the labels label = [label label$i]
Then just write: h5write(data_file_path, "/label", label)
Same can be done if you have groups or more complicated hdf5 files.
Ashley's answer worked well for me. Here is an implementation of her suggestion in Julia:
Make text file listing the files to concatenate in bash:
ls -rt $somedirectory/$somerootfilename-*.hdf5 >> listofHDF5files.txt
Write a julia script to concatenate multiple files into one file:
# concatenate_HDF5.jl
using HDF5
inputfilepath=ARGS[1]
outputfilepath=ARGS[2]
f = open(inputfilepath)
firstit=true
data=[]
for line in eachline(f)
r = strip(line, ['\n'])
print(r,"\n")
datai = h5read(r, "/data")
if (firstit)
data=datai
firstit=false
else
data=cat(4,data, datai) #In this case concatenating on 4th dimension
end
end
h5write(outputfilepath, "/data", data)
Then execute the script file above using:
julia concatenate_HDF5.jl listofHDF5files.txt final_concatenated_HDF5.hdf5

Resources