Apache Spark: batch processing of files - batch-file

I have directories, sub directories setup on HDFS and I'd like to pre process all the files before loading them all at once into memory. I basically have big files (1MB) that once processed will be more like 1KB, and then do sc.wholeTextFiles to get started with my analysis
How do I loop on each file (*.xml) on my directories/subdirectories, do an operation (let's say for the example's sake, keep the first line), and then dump the result back to HDFS (new file, say .xmlr) ?

I'd recommend you just to use sc.wholeTextFiles and preprocess them using transformations, after that save all of them back as a single compressed sequence file (you can refer to my guide to do so: http://0x0fff.com/spark-hdfs-integration/)
Another option might be to write a mapreduce that would process the whole file at a time and save them to the sequence file as I proposed before: https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/SmallFilesToSequenceFileConverter.java. It is the example described in 'Hadoop: The Definitive Guide' book, take a look at it
In both cases you would do almost the same, both Spark and Hadoop will bring up a single process (Spark task or Hadoop mapper) to process these files, so in general both of the approaches will work using the same logic. I'd recommend you to start with a Spark one as it is simpler to implement given the fact you already have a cluster with Spark

Related

TCL Opaque Handle C lib

I am confused with the following code from tcl wiki 1089,
#define TEMPBUFSIZE 256 /* usually enough space! */
char buf[[TEMPBUFSIZE]];
I was curious and I tried to compile the above syntax in gcc & armcc, both fails. I was looking to understand how tcl handle to file pointer mechanism works to solve the chaos on data logging by multiple jobs running in a same folder [log files unique to jobs].
I have multiple tcl scripts running in parallel as LSF Jobs each using a log file.
For example,
Job1 -> log1.txt
Job2 -> log2.txt
(file write in both case is "intermittent" over the entire job execution)
Some of the text which I expect to be part of log1.txt is written to log2.txt and vice versa randomly. I have tried with "fconfigure $fp -buffering none", the behaviour still persists. One important note, all the LSF jobs are submitted from the same folder and if I submit the jobs from individual folder, the log files dont have text write from other job. I would like the jobs to be executed from same folder to reduce space consumption from repeating the resource in different folder.
Question1:
Can anyone advice me how the tcl "handles" is interpreted to a pointer to the memory allocated for the log file? I have mentioned intermitent because of the following, "Tcl maps this string internally to an open file pointer when it is time for the interpreter to do some file I/O against that particular file - wiki 1089"
Question2:
Is there a possiblity that two different "open" can end up having same "file"?
Somewhere along the line, the code has been mangled; it looks like it happened when I converted the syntax from one type of highlighting scheme to another in 2011. Oops! My original content used:
char buf[TEMPBUFSIZE];
and that's what you should use. (I've updated the wiki page to fix this.)

tokyo cabinet: .tcb.wal file created along with .tcb file. Db size doesnot decreases while deleting records

I am using tokyo cabinets B+ tree API to create a lookup database. On linux environment I see a .tcb.wal file created along with the actual .tcb database file. The size of this file is 0. I wonder whether its a lock file that is created to help synchronization. Also when I delete records from the database the size of the file does not decrease. Any reasons why its behaving like that?
The extension .wal stands for Write Ahead Logging file. This file is only relevant if you use any transaction functions; most applications do not use these. (For details, search for "ahead" in the documentation.)
The file size does not change for every deletion for efficiency reasons. Similarly, if you create an empty database, it will reserve space for faster insertions.

Reading a file directly from HDFS into a shell function

I have a shell function that is called from inside my map function. The shell function takes 2 parameters -> an input file and an output file. Something like this
$> unix-binary /pathin/input.txt /pathout/output.txt
The problem is, that these input.txt files reside in HDFS and the output.txt files need to be written back to HDFS. Currently, I first copy the needed file with fs.copyToLocalFile into the local hard drive, call the unix binary and then write the output.txt back to HDFS with fs.copyFromLocalFile.
The problem with this approach is that, it is not optimal because it involves substantial amount of redundant reading and writing to HDD which slows down the performance. So, my question is, how I can read the HDFS file directly as an input and output the results directly to HDFS?
obviously,
$>unix-binary hdfs://master:53410/pathin/input.txt' hdfs://master:54310/pathout/output.txt
will not work. Is there any other way around? Can I treat an HDFS file as a loacl file somehow?
I have access to the unix-binary source code written in C. Maybe changing the source code would help?
thanks
You can add the file to the DistributedCache and access it from the mapper from the cache. Call your shell function on the local file and write the output file to local disk and then copy the local file to HDFS.
However, operations such as calling shell functions, or reading/writing from within a mapper/reducer break the MapReduce paradigm. If you find yourself needing to perform such operations, MapReduce may not be the solution you're looking for. HDFS and MapReduce were designed to perform massive scale batch processing on small numbers of extremely large files.
Since you have access to unix-binary source code, your best option might be to implement the particular function(s) you want in java. Feed the input files to your mapper and call the function from the mapper on the data rather than working with files on HDFS/LocalFS.

Excel Files, C, and misery

Alright, so, I haven't programmed anything useful in ages - last time I did was a year ago and as you can imagine my knowledge of programming is seriously rusty. (last thing I 'programmed' was a ren'py game over the weekend. One can imagine the limited uses of this. The most advanced C program I wrote was a tic-tac-toe game a year ago. So yeah.)
Anyways, I've been given a job to write a program that takes two Excel files, both of which have a list of items, each associated with an ID. I need to write a program to search both files for IDs and if the IDs match, the program will need to create a new file with the matched IDs and items. This is insanely beyond my limited C capabilities.
If anyone could help, I would seriously appreciate it.
(also, if this is not possible with C, I'll do my best to work with any other languages)
Export the two files to .csv format and write a script to process the two files. For example, in PHP, you have built in csv read/write capabilities.
You can do this with VBA and create a Macro in one of the files which iterates over the cells in your column in file 1 and compares them to cells in file 2 and writes them to a new .xls file if they match.
Dana points out that the VLOOKUP function will do this quite easily.
Install GnuWin32
Output the excel files as text (csv for example)
sort each file with the -u option to remove duplicates if needed
mix and sort the 2 files
count unique IDs with uniq -c
filter out lines with a value of 1 for the count with grep
remove the count leaving the ID and whatever else you need with cut
If you know Java then you can use Apache POI for your project. You can use the examples given on the Apache POI website to accomplish your task.
Apache POI Excel Documentation: http://poi.apache.org/spreadsheet/quick-guide.html
If you absolutely have to do this on xls/xlsx file from a process, you probably need a copy of Excel controlled by COM automation. You can do this in BV/VBA/C#/C++ whatever, some easier than others. Google for 'Excel automation'.
Rgds,
Martin
Not C, but you may be able to cobble something together very quickly using xlsperl.
It has come in handy for me in the past.

Following multiple log files efficiently

I'm intending to create a programme that can permanently follow a large dynamic set of log files to copy their entries over to a database for easier near-realtime statistics. The log files are written by diverse daemons and applications, but the format of them is known so they can be parsed. Some of the daemons write logs into one file per day, like Apache's cronolog that creates files like access.20100928. Those files appear with each new day and may disappear when they're gzipped away the next day.
The target platform is an Ubuntu Server, 64 bit.
What would be the best approach to efficiently reading those log files?
I could think of scripting languages like PHP that either open the files theirselves and read new data or use system tools like tail -f to follow the logs, or other runtimes like Mono. Bash shell scripts probably aren't so well suited for parsing the log lines and inserting them to a database server (MySQL), not to mention an easy configuration of my app.
If my programme will read the log files, I'd think it should stat() the file once in a second or so to get its size and open the file when it's grown. After reading the file (which should hopefully only return complete lines) it could call tell() to get the current position and next time directly seek() to the saved position to continue reading. (These are C function names, but actually I wouldn't want to do that in C. And Mono/.NET or PHP offer similar functions as well.)
Is that constant stat()ing of the files and subsequent opening and closing a problem? How would tail -f do that? Can I keep the files open and be notified about new data with something like select()? Or does it always return at the end of the file?
In case I'm blocked in some kind of select() or external tail, I'd need to interrupt that every 1, 2 minutes to scan for new or deleted files that shall (no longer) be followed. Resuming with tail -f then is probably not very reliable. That should work better with my own saved file positions.
Could I use some kind of inotify (file system notification) for that?
If you want to know how tail -f works, why not look at the source? In a nutshell, you don't need to periodically interrupt or constantly stat() to scan for changes to files or directories. That's what inotify does.

Resources