Camel - File Reading 10 lines batch - apache-camel

We are trying to read a very large CSV file(which cannot be fully loaded to memory) in batches say 100 lines per batch) using Apache Camel. Any assistance that can be provided will be greatly appreciated.

Use the splitter EIP in streaming mode: http://camel.apache.org/splitter
And read the link and see the section about grouping N lines together. This allows you to read and process the files with 100 lines at a time.

you can use the throttler to throttle the number files loaded at a time.

Use split with groups, e.g.:
from(CSV).split().tokenize("\n", 100).streaming()
where each Exchange body will be a String containing a group of 100 lines.

Related

Fast way to get information from a huge logfile on unix

i have a 6 GB applicationlogfile. The loglines have the following format (shortened)
[...]
timestamp;hostname;sessionid-ABC;type=m
timestamp;hostname;sessionid-ABC;set_to_TRUE
[...]
timestamp;hostname;sessionid-HHH;type=m
timestamp;hostname;sessionid-HHH;set_to_FALSE
[...]
timestamp;hostname;sessionid-ZZZ;type=m
timestamp;hostname;sessionid-ZZZ;set_to_FALSE
[...]
timestamp;hostname;sessionid-WWW;type=s
timestamp;hostname;sessionid-WWW;set_to_TRUE
I have a lot of session with more then these 2 lines.
I need to find out all sessions with type=m and set_to_TRUE
My first attempt was to grep all sessionIDs with type=m and write it into a file. Then looping with every line from the file (1 sessionID per line) trough the big logfile and grep for sessionID;set_to_TRUE
This method takes a loooot of time. Can anyone give me a hint to solve this in a much better and faster way?
Thanks a lot!

camel aggregate lines and split into files of different sizes

My route read a file with a number of lines and filter some lines out.
It split the file on lines and filter and aggregate to a file.
The file uri is in append mode so each aggregation is appended to it. A done file is created everytime I write to it.
After the file is fully written to, another route picks up the file.
This route split the file into files of n files of equal number of records. But I am running into an issue where the done file is updated for every aggregation in step 1.
How do I update the done file only when the aggregation is fully done ?
I tried to use property ${exchangeProperty.CamelBatchComplete} in the route1.
But that property is always set to true on aggregation...
Its harder to help with just a bit confusing description of your use-case without some basic code example. However you can just write the done file yourself when you are done, its a few lines of Java code

Apache Spark: batch processing of files

I have directories, sub directories setup on HDFS and I'd like to pre process all the files before loading them all at once into memory. I basically have big files (1MB) that once processed will be more like 1KB, and then do sc.wholeTextFiles to get started with my analysis
How do I loop on each file (*.xml) on my directories/subdirectories, do an operation (let's say for the example's sake, keep the first line), and then dump the result back to HDFS (new file, say .xmlr) ?
I'd recommend you just to use sc.wholeTextFiles and preprocess them using transformations, after that save all of them back as a single compressed sequence file (you can refer to my guide to do so: http://0x0fff.com/spark-hdfs-integration/)
Another option might be to write a mapreduce that would process the whole file at a time and save them to the sequence file as I proposed before: https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/SmallFilesToSequenceFileConverter.java. It is the example described in 'Hadoop: The Definitive Guide' book, take a look at it
In both cases you would do almost the same, both Spark and Hadoop will bring up a single process (Spark task or Hadoop mapper) to process these files, so in general both of the approaches will work using the same logic. I'd recommend you to start with a Spark one as it is simpler to implement given the fact you already have a cluster with Spark

identifying data file type

I have a huge 1.9 GB data file without extension I need to open and get some data from, the problem is this data file is extension-less and I need to know what extension it should be and what software I can open it with to view the data in a table.
here is the picture :
Its only 2 lines file, I already tried csv on excel but it did not work, any help ?
I have never use it but you could try this:
http://mark0.net/soft-tridnet-e.html
explained here:
http://www.labnol.org/software/unknown-file-extensions/20568/
The third "column" of that line looks 99% chance to be from php's print_r function (with newlines imploded to be able to stored on a single line).
There may not be a "format" or program to open it with if its just some app's custom debug/output log.
A quick google found a few programs to split large files into smaller units. THat may make it easier to load into something (may or may not be n++) for reading.
It shouldnt be too hard to mash out a script to read the lines and reconstitute the session "array" into a more readable format (read: vertical, not inline), but it would to be a one-off custom job, since noone other than the holder of your file would have a use for it.

Concatenate a large number of HDF5 files

I have about 500 HDF5 files each of about 1.5 GB.
Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable number of samples.
Now I want to concatenate all this files by concatenating each of the datasets so that at the end I have a single 750 GB file with my 7 datasets.
Currently I am running a h5py script which:
creates a HDF5 file with the right datasets of unlimited max
open in sequence all the files
check what is the number of samples (as it is variable)
resize the global file
append the data
this obviously takes many hours,
would you have a suggestion about improving this?
I am working on a cluster, so I could use HDF5 in parallel, but I am not good enough in C programming to implement something myself, I would need a tool already written.
I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).
Then I create the global h5file setting the total length to the sum of all the files.
Only after this phase I fill the h5file with the data from all the small files.
now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.
I get that answering this earns me a necro badge - but things have improved for me in this area recently.
In Julia this takes a few seconds.
Create a txt file that lists all the hdf5 file paths (you can use bash to do this in one go if there are lots)
In a loop read each line of txt file and use label$i = h5read(original_filepath$i, "/label")
concat all the labels label = [label label$i]
Then just write: h5write(data_file_path, "/label", label)
Same can be done if you have groups or more complicated hdf5 files.
Ashley's answer worked well for me. Here is an implementation of her suggestion in Julia:
Make text file listing the files to concatenate in bash:
ls -rt $somedirectory/$somerootfilename-*.hdf5 >> listofHDF5files.txt
Write a julia script to concatenate multiple files into one file:
# concatenate_HDF5.jl
using HDF5
inputfilepath=ARGS[1]
outputfilepath=ARGS[2]
f = open(inputfilepath)
firstit=true
data=[]
for line in eachline(f)
r = strip(line, ['\n'])
print(r,"\n")
datai = h5read(r, "/data")
if (firstit)
data=datai
firstit=false
else
data=cat(4,data, datai) #In this case concatenating on 4th dimension
end
end
h5write(outputfilepath, "/data", data)
Then execute the script file above using:
julia concatenate_HDF5.jl listofHDF5files.txt final_concatenated_HDF5.hdf5

Resources