Spark alternatives for reading incomplete files - file

I am using Spark 2.3 to read Parquet files. The files average about 20mb each and currently have about 20,000 of them. The file directories are partitioned down to the day level across 24 months.
The issue I am facing is that on an occasional basis most of the files are rewritten by an Hbase application. This does not occur every day, but when it does it takes several days for that process to complete for all files. My process needs to look at the data in ALL files each day to scan for changes based on an update date stored in each record. But I am finding if I read the HDFS dir during the period of the mass file rewrite, my Spark program fails and I get errors like these:
java.io.EOFException: Cannot seek after EOF
OR
java.io.IOException: Could not read footer for file: FileStatus{path=hdfs://project/data/files/year=2020/month=02/day=10/file_20200210_11.parquet
OR
Caused by: java.lang.RuntimeException: hdfs://project/data/files/year=2020/month=02/day=10/file_20200210_11.parquet is not a Parquet file. expected magic number at tail
I assume the errors are because it is trying to read a file that is not finalized (still being written to). But from Spark's perspective it looks like a corrupt file. So my process would fail and not be able to read the files for a couple days until all files are stable.
Code is pretty basic:
val readFilePath="/project/data/files/"
val df = spark.read
.option("spark.sql.parquet.mergeSchema", "true")
.parquet(readFilePath)
.select(
$"sourcetimestamp",
$"url",
$"recordtype",
$"updatetimestamp"
)
.withColumn("end_date", to_date(date_format(lit("2021-08-23"), "yyyy-MM-dd")))
.withColumn("start_date", date_sub($"end_date", 2))
.withColumn("update_date", to_date($"updatetimestamp","yyyy-MM-dd"))
.filter( $"update_date" >= $"start_date" and $"update_date" <= $"end_date" )
Is there anything I can do programmatically to work around a problem like this? It doesn't seem like I can trap an error like that and make it continue. In Spark version 3 there is an option "spark.sql.files.ignoreCorruptFiles", that I think would help, but not in version 2.3 so that doesn't help me.
I considered reading a single file at a time and loop through all files, but that would take forever (about 12 hours based on a test of just a single month). Outside of asking the application owner to write changed files to a temp dir then move and replace them as each is completed into the main dir, I don't see any other alternatives so far. Not even sure if that would help or if I would run into collisions during the small period of time the file was being moved.

Related

TCL Opaque Handle C lib

I am confused with the following code from tcl wiki 1089,
#define TEMPBUFSIZE 256 /* usually enough space! */
char buf[[TEMPBUFSIZE]];
I was curious and I tried to compile the above syntax in gcc & armcc, both fails. I was looking to understand how tcl handle to file pointer mechanism works to solve the chaos on data logging by multiple jobs running in a same folder [log files unique to jobs].
I have multiple tcl scripts running in parallel as LSF Jobs each using a log file.
For example,
Job1 -> log1.txt
Job2 -> log2.txt
(file write in both case is "intermittent" over the entire job execution)
Some of the text which I expect to be part of log1.txt is written to log2.txt and vice versa randomly. I have tried with "fconfigure $fp -buffering none", the behaviour still persists. One important note, all the LSF jobs are submitted from the same folder and if I submit the jobs from individual folder, the log files dont have text write from other job. I would like the jobs to be executed from same folder to reduce space consumption from repeating the resource in different folder.
Question1:
Can anyone advice me how the tcl "handles" is interpreted to a pointer to the memory allocated for the log file? I have mentioned intermitent because of the following, "Tcl maps this string internally to an open file pointer when it is time for the interpreter to do some file I/O against that particular file - wiki 1089"
Question2:
Is there a possiblity that two different "open" can end up having same "file"?
Somewhere along the line, the code has been mangled; it looks like it happened when I converted the syntax from one type of highlighting scheme to another in 2011. Oops! My original content used:
char buf[TEMPBUFSIZE];
and that's what you should use. (I've updated the wiki page to fix this.)

Python reading of files

I am new with Python and I am facing my first troubles.
I have to read some .dat files (100), and each file contains a set of 5000 power traces. The total amount of memory taken by the files is almost 10 GB, so I cannot read the files all toghether because I fill the RAM. So, the np.fromfile method with a for loop in every files is not usefull.
I would like to make a memory mapping, reading just few files at time, but I need to handle the data at the same time.
Do you have some suggestion?
Cheers

Apache Spark: batch processing of files

I have directories, sub directories setup on HDFS and I'd like to pre process all the files before loading them all at once into memory. I basically have big files (1MB) that once processed will be more like 1KB, and then do sc.wholeTextFiles to get started with my analysis
How do I loop on each file (*.xml) on my directories/subdirectories, do an operation (let's say for the example's sake, keep the first line), and then dump the result back to HDFS (new file, say .xmlr) ?
I'd recommend you just to use sc.wholeTextFiles and preprocess them using transformations, after that save all of them back as a single compressed sequence file (you can refer to my guide to do so: http://0x0fff.com/spark-hdfs-integration/)
Another option might be to write a mapreduce that would process the whole file at a time and save them to the sequence file as I proposed before: https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/SmallFilesToSequenceFileConverter.java. It is the example described in 'Hadoop: The Definitive Guide' book, take a look at it
In both cases you would do almost the same, both Spark and Hadoop will bring up a single process (Spark task or Hadoop mapper) to process these files, so in general both of the approaches will work using the same logic. I'd recommend you to start with a Spark one as it is simpler to implement given the fact you already have a cluster with Spark

Following multiple log files efficiently

I'm intending to create a programme that can permanently follow a large dynamic set of log files to copy their entries over to a database for easier near-realtime statistics. The log files are written by diverse daemons and applications, but the format of them is known so they can be parsed. Some of the daemons write logs into one file per day, like Apache's cronolog that creates files like access.20100928. Those files appear with each new day and may disappear when they're gzipped away the next day.
The target platform is an Ubuntu Server, 64 bit.
What would be the best approach to efficiently reading those log files?
I could think of scripting languages like PHP that either open the files theirselves and read new data or use system tools like tail -f to follow the logs, or other runtimes like Mono. Bash shell scripts probably aren't so well suited for parsing the log lines and inserting them to a database server (MySQL), not to mention an easy configuration of my app.
If my programme will read the log files, I'd think it should stat() the file once in a second or so to get its size and open the file when it's grown. After reading the file (which should hopefully only return complete lines) it could call tell() to get the current position and next time directly seek() to the saved position to continue reading. (These are C function names, but actually I wouldn't want to do that in C. And Mono/.NET or PHP offer similar functions as well.)
Is that constant stat()ing of the files and subsequent opening and closing a problem? How would tail -f do that? Can I keep the files open and be notified about new data with something like select()? Or does it always return at the end of the file?
In case I'm blocked in some kind of select() or external tail, I'd need to interrupt that every 1, 2 minutes to scan for new or deleted files that shall (no longer) be followed. Resuming with tail -f then is probably not very reliable. That should work better with my own saved file positions.
Could I use some kind of inotify (file system notification) for that?
If you want to know how tail -f works, why not look at the source? In a nutshell, you don't need to periodically interrupt or constantly stat() to scan for changes to files or directories. That's what inotify does.

A simple log file format

I'm not sure if it was asked, but I couldn't find anything like this.
My program uses a simple .txt file for log purposes, It just creates/opens a file and appends lines.
After some time, I started to log quite a lot of activities, so the file became too large and hardly readable. I know, that it's not write way to do this, but I simply need to have a readable file.
So I thought maybe there's a simple file format for log files and a soft to view it or if you'd have any other suggestions on this question?
Thanks for help in advance.
UPDATE:
It's access 97 application. I'm logging some activities like Form Loading, SELECT/INSERT/UPDATE to MS SQL Server ...
The log file isn't really big, I just write the duration of operations, so I need a simple way to do this.
The Log file is created on a user's machine. It's used for monitoring purposes logging some activities' durations.
Is there a way of viewing that kind of simple Log file highlighted with an existing tool?
Simply, I'd like to:
1) Write smth like "'CurrentTime' 'ActivityName' 'Duration in milliseconds' " (maybe some additional information like query string) into a file.
2) Open it with a tool and view it highlighted or somehow more readable.
ANSWER: I've found a nice tool to do all I've wanted. Check my answer.
LogExpert
The 3 W's :
When, what and where.
For viewing something like multitail ("tail on steroids") http://www.vanheusden.com/multitail/
or for pure ms windows try mtail http://ophilipp.free.fr/op_tail.htm
And to keep your files readable, you might want to start new files when if the filesize of the current log file is over certain limit. Example:
activity0.log (1 mb)
activity1.log (1 mb)
activity2.log (1 mb)
activity3.log (1 mb)
activity4.log (205 bytes)
A fairly standard way to deal with logging from an application into a plain text file is to:
split the logs into different program functional areas.
rotate the logs on a daily/weekly basis (i.e. split the log on a size or date basis)
so the current log is "mylog.log" or whatever, and yesterday's was "mylog.log.1" or "mylog.ddmmyyyy.log"
This keeps the size of the active log manageable. And then you can just have expiry rules so that old logs get thrown away on a regular basis.
In addition it would be a good idea to have different log levels for your application (info/warning/error/fatal) so that you're not logging more than is necessary.
First, check you're only logging things that are useful.
If it's all useful, make sure it is easily parsable by tools such as grep, that way you can find the info you want. Make sure you have the type of log entry, the date/time all conforming to a layout.
Build yourself a few scripts to extract the information for you.
Alternatively, use separate log files for different types of entries.
Basically you better just split logs according to severity. You'll rarely need to read all logs for the whole system. For example apache allows to configure error log and access log, pretty obvious what info exactly they have.
If you're under linux system grep is your best tool to search through logs for specific entries.
Look at popular logfiles like /var/log/syslog on Unix to get ideas:
MMM DD HH:MM:SS hostname process[pid]: message
Example out of my syslog:
May 11 12:58:39 raphaelm anacron[1086]: Normal exit (1 job run)
But to give you the perfect answer we'd need more information about what you are logging, how much and how you want to read the logs.
If only the size of the log file is the problem, I recommend using logrotate or something similar. logrotate watches log files and, depending on how you configured it, after a given time or when the log file exceeds a given size, it moves the log file to an archive directory and optionally compresses it. Then the original log file is truncated. For example, you could configure it to archive the log file every 24 hours or whenever the files size exceeds 500kb.
If this is a program, you might investigate apache logging libraries (http://logging.apache.org/) Out of the box, they'll give you a decent logging format out of the box. They're also customizable, so you can simplify your parsing job.
If this is a script, see some of the other answers.
LogExpert
I've found it here. Filter is better, than in mtail. There's an option of highlighting just adding a string and the app is nice and readable. You can customize columns as you like.

Resources