I am writing output from a simulation to a file using the following code
sprintf(filename, "time_data.dat");
FILE *fp = fopen(filename,"w");
for(i=0;i<ntime;i++){
compute_data();
fprintf(fp, "%d %lf %lf \n", step, time_val ,rho_rms);
}
return;
On my desktop, I get to see the file time_data.dat update every few hours (compute_data() takes a few hundred seconds per time step, with OpenMP on an i7 machine). I have now submitted the job to a cluster node (E5 2650 processor running ubuntu server). I have been waiting for 5 days now, and not a line line has appeared in the file yet. I do
tail -f time_data.dat
to check the output. The simulation will take another couple of weeks to complete. I can't wait for that long to see if my output is good. Is there a way I can probe the OS in the node to flush its buffer without disturbing the computation? If I cancel the job now, I am sure there won't be any output.
Please note that the hard disk to which the output file is being written is one shared using NFS over multiple nodes and the master node. Is this causing any trouble? Is there a temporary file place were the output is actually being written?
PS: I did du -h to find the file showing size 0. I also tried ls -l proc/$ID/fd to confirm the file did open.
You might use lsof or simply ls -l /proc/$(pidof yoursimulation)/fd to check (on the cluster node) that indeed time_data.dat has been opened.
For such long-running programs, I would believe it is worthwhile to consider using:
application checkpointing techniques
persistency of your application data e.g. in some database
design some way to query your app's state (eg use some HTTP server library such as libonion, or at least have some JSONRPC or other service to query something about the state)
Related
For years I have being using variasons of du command below in order to produce a report of the largest files from specific location, and most of the time it worked well.
du -L -ch /var/log | sort -rh | head -n 10 &> log-size.txt
This this proved to get stuck in several cases, in a way that prevented stopping it with even the timeout -s KILL 5m ... approach.
Few years back this was caused by stalled NFS mounts but more recently I got into this in on VMs where I didn't use NFS at all. Apparently there is a ~1:30 chance to get this on openstack builds.
I read that following symbolic links (-L) can block "du" in some cases if there are loops but my tests failed to reproduce the problem, even when I created some loop.
I cannot avoid following the symlinks because that's how the files are organized.
What would be safer alternative to generate this report, one that would not block or at least if it does, it can be constrainted to a maximum running duration. It is essential to limit the execution of this command to a number of minutes -- if I can also get a partial result on timeouts or some debuggin info even better.
If you don't care about sparse files and can make do with apparent size (and not the on-disk size), then ls should work just fine: ls -L --sort=s|head -n10> log-size.txt
I'm writing a MPI application that takes a filename as an argument and tries to read from the file using regular C functions. I run this application on several nodes of a cluster by using qsub, which in turn uses mpiexec.
The application runs just fine on a local node where the file is. For this I just call mpiexec directly:
mpiexec -n 4 ~/my_app ~/input_file.txt
But when I submit it with qsub to be run on other nodes of the cluster, the file reading part fails. The application errors at fopen call -- it can't open the file (likely because it's not present).
The question is, how do I make the file available to all nodes? I have looked over qsub manpage and couldn't fine anything relevant.
I guess Vanilla Gorilla doesn't need an answer any more? However, let's consider the case of a pathological system with no parallel file system and a file system available only at one node. There is a way in ROMIO (a very common MPI-IO implementation) to achieve your goal:
how can i transfer file from one proccess to all other with mpi?
I'm creating a scheduling software, i have hundreds of dataset described in text files.
I'm using dirent.h with a loop to read the texts files, for each file i make a schedule and i append the result to another text file ( like cpu time, dataset name, tardiness ...), this file is common to all schedules.
I'm opening/closing the result file just once ( fopen() before the loop, fclose() after the loop when all the schedules are done).
I've no problem on Windows 7, but under linux, the file seems to be closed by the system due to a kind of timeout, I've just 9-10 dataset that are scheduled (~ 2 hours) and after it is stuck because it can't write into the result file :/
Does anyone already have this kind of trouble and found a solution?
Linux does not close the file automatically. Something is wrong in your code.
Try running your program using "strace" and identify where the close() happens.
strace -f -o 1.txt ./my_best_app_ever
Open the 1.txt file using a text editor (or less) and see what your app is doing.
I'm writing an inotify watcher in C for a Minecraft server. Basically, it watches server.log, gets the latest line, parses it, and if it matches a regex; performs some actions.
The program works fine normally through "echo string matching the regex >> server.log", it parses and does what it should. However, when the string is written to the file automatically via Minecraft server, it doesn't work until I shut down the server or (sometimes) log out.
I would post code, but I'm wondering if it doesn't have something to do with ext4 flushing data to disk or something along those lines; a filesystem problem. It would be odd if that were the case though, because "tail -f server.log" updates whenever the file does.
Solved my own problem. It turned out the server was writing to the log file faster than the watcher could read from it; so the watcher was getting out of sync.
I fixed it by adding a check after it processes the event saying "if the number of lines currently in the log file is more than the recorded length of the log, reprocess the file until the two are equal."
Thanks for your help!
Presumably that is because you are watching for IN_CLOSE events, which may not occur until the server shuts down (and closes the log file handle). See man inotify(7) for valid mask parameters for the inotify_add_watch() call. I expect you'll want to use IN_WRITE.
Your theory is more than likely correct, the log file is being buffered by the OS, and the log writer has no flushing of that buffer, so everything will remain in the buffer till the file is closed or the buffer is full. A fast way to test is to start up the log to the point where you know it would have written events to the log, then forcibly close it so it cannot close the handle, if the log is empty is definitly the buffer. If you can get hold of the file handle/descriptor, you can use setbuf to remove buffering, at the cost of performance.
I'm intending to create a programme that can permanently follow a large dynamic set of log files to copy their entries over to a database for easier near-realtime statistics. The log files are written by diverse daemons and applications, but the format of them is known so they can be parsed. Some of the daemons write logs into one file per day, like Apache's cronolog that creates files like access.20100928. Those files appear with each new day and may disappear when they're gzipped away the next day.
The target platform is an Ubuntu Server, 64 bit.
What would be the best approach to efficiently reading those log files?
I could think of scripting languages like PHP that either open the files theirselves and read new data or use system tools like tail -f to follow the logs, or other runtimes like Mono. Bash shell scripts probably aren't so well suited for parsing the log lines and inserting them to a database server (MySQL), not to mention an easy configuration of my app.
If my programme will read the log files, I'd think it should stat() the file once in a second or so to get its size and open the file when it's grown. After reading the file (which should hopefully only return complete lines) it could call tell() to get the current position and next time directly seek() to the saved position to continue reading. (These are C function names, but actually I wouldn't want to do that in C. And Mono/.NET or PHP offer similar functions as well.)
Is that constant stat()ing of the files and subsequent opening and closing a problem? How would tail -f do that? Can I keep the files open and be notified about new data with something like select()? Or does it always return at the end of the file?
In case I'm blocked in some kind of select() or external tail, I'd need to interrupt that every 1, 2 minutes to scan for new or deleted files that shall (no longer) be followed. Resuming with tail -f then is probably not very reliable. That should work better with my own saved file positions.
Could I use some kind of inotify (file system notification) for that?
If you want to know how tail -f works, why not look at the source? In a nutshell, you don't need to periodically interrupt or constantly stat() to scan for changes to files or directories. That's what inotify does.