Potential Dangers of Running Code in Parallel - c

I am working in OSX and using bash for my shell. I have a script which calls an executable hundreds of times, and each call is independent of the other. Therefore I am going to run this code in parallel. However, each call to the executable appends output to a community text file on a new line.
The ordering of the text file is not of importance (although it would be nice, but totally not worth over complicating since I can just use unix sort command), but what is, is that every call of the executable properly printed to the file. My concern is that if I run the script in parallel that the by some freak accident, two threads will check out the text file, print to it and then save different copies back to the original directory of the text file. Thus nullifying one of the writes to the file.
Does this actually happen, or is my understanding of printing to a file flawed? I don't fully know if this would also be a case by case bases so I will provide some mock code of what is being done in my program below.
Script:
#!/bin/sh
abs=$1
input=$(echo "$abs" | awk '{print 0.004 + 0.005*$1 }')
./program input
"./program":
~~Normal .c file stuff here~~
~~VALUE magically calculated here~~
~~run number is pulled out of input and assigned to index for sorting~~
FILE *fpp;
fpp = fopen("Doc.txt","a");
fprintf(fpp,"%d, %.3f\n", index, VALUE);
fclose(fpp);
~Closing events of program.c~~
Commands to run script in parallel in bash:
printf "%s\n" {0..199} | xargs -P 8 -n 1 ./program
Thanks for any help you guys can offer.

A write() call (like fwrite()) with the append flag set in open() (like during fopen()) is guaranteed to avoid the race condition you describe.
O_APPEND
If set, the file offset shall be set to the end of the file prior to each write.
From: POSIX specifications for open:
opengroup.org open

Race conditions are what you are thinking of.
Not 100% sure but if you simple append to the end of the file rather than opening it and editing it should be right

If you have the option, make your program write to standard output instead of directly to a file. Then you can let the shell merge the output of your programs:
printf "%s\n" {0..199} | parallel -P 8 -n 1 ./program > merged_output.txt

Yeah, that looks like a recipe for disaster. If those processes both hit opening the file at the roughly the same time, only one will "take".
I suggest either (easier) writing to separate files then catting them together when the processing is done, or (harder) sending all results to a consumer process that will write the file for everyone.

Related

Easiest way to overwrite a series of files with zeros

I'm on Linux. I have a list of files and I'd like to overwrite them with zeros and remove them. I tried using
srm file1 file2 file3 ...
but it's too slow (I have to overwrite and remove ~50 GB of data) and I don't need that kind of security (I know that srm does a lot of passes instead of a single pass with zeros).
I know I could overwrite every single file using the command
cat /dev/zero > file1
and then remove it with rm, but I can't do that manually for every single file.
Is there a command like srm that does a single pass of zeros, or maybe a script that can do cat /dev/zero on a list of files instead of on a single one? Thank you.
Something like this, using stat to get the correct size to write, and dd to overwrite the file, might be what you need:
for f in $(<list_of_files.txt)
do
read blocks blocksize < <(stat -c "%b %B" ${f})
dd if=/dev/zero bs=${blocksize} count=${blocks} of=${f} conv=notrunc
rm ${f}
done
Use /dev/urandom instead of /dev/zero for (slightly) better erasure semantics.
Edit: added conv=notrunc option to dd invocation to avoid truncating the file when it's opened for writing, which would cause the associated storage to be released before it's overwritten.
I use shred for doing this.
The following are the options that I generally use.
shred -n 3 -z <filename> - This will make 3 passes to overwrite the file with random data. It will then make a final pass overwriting the file with zeros. The file will remain on disk though, but it'll all the 0's on disk.
shred -n 3 -z -u <filename> - Similar to above, but also unlinks (i.e. deletes) the file. The default option for deleting is wipesync, which is the most secure but also the slowest. Check the man pages for more options.
Note: -n is used here to control the number of iterations for overwriting with random data. Increasing this number, will result in the shred operation taking longer to complete and better shredding. I think 3 is enough but maybe wrong.
The purpose of srm is to destroy the data in the file before releasing its blocks.
cat /dev/null > file is not at all equivalent to srm because
it does not destroy the data in the file: the blocks will be released with the original data intact.
Using /dev/zero instead of /dev/null does not even work because /dev/zero never ends.
Redirecting the output of a program to the file will never work for the same reason given for cat /dev/null.
You need a special-purpose program that opens the given file for writing, writes zeros over all bytes of the file, and then removes the file. That's what srm does.
Is there a command like srm that does a single pass of zeros,
Yes. SRM does this with the correct parameters. From man srm:
srm -llz
-l lessens the security. Only two passes are written: one mode with
0xff and a final mode random values.
-l -l for a second time lessons the security even more: only one
random pass is written.
-z wipes the last write with zeros instead of random data
srm -llzr will do the same recursively if wiping a directory.
You can even use 'srm -llz [file1] [file2] [file3] to wipe multiple files i this way with a single command

Estimating the execution time of code on local linux system

I spend most of my time solving problems on Topcoder/SPOJ. So definitely I thought of performance (execution time) of my code on my system before submitting the code.
So, on searching I found time command in linux. But the problem is that it also includes the time for inputting the values for several test cases, in addition to processing time. So I thought of making an input file and sending that content to my code.
Something like
cat input.txt > ./myprogram
But this doesn't work. (I am not good at linux pipelining). Can anyone point out the mistake, or a better approach to judge my code execution time?
EDIT
All of my programs read from stdin
You need this:
./myprogram < input.txt
Or if you insist on the Useless Use of Cat:
cat input.txt | ./myprogram
You can put time in front of ./myprogram in either case.
You might want to look at xargs.
Something along the lines of
cat input.txt | xargs ./myprogram
You can add below code in your script for
assign the file descriptor to file for input and output fd # 3 is
Input file
exec 3< input.txt
Use read command in while loop to read the file line by line
while read -u 3 -r a
do

How do I add an operator to Bash in Linux?

I'd like to add an operator ( e.g. ^> ) to handle prepend instead append (>>). Do I need to modify Bash source or is there an easier way (plugin, etc)?
First of all, you'd need to modify bash sources and quite heavily. Because, above all, your ^> would be really hard to implement.
Note that bash redirection operators usually do a very simple writes, and work on a single file (or program in case of pipes) only. Excluding very specific solutions, you usually can't write to a beginning of a file for the very simple reason you'd need to move all remaining contents forward after each write. You could try doing that but it will be hard, very ineffective (since every write will require re-writing the whole file) and very unsafe (since with any error you will end up with random mix of old and new version).
That said, you are indeed probably better off with a function or any other solution which would use a temporary file, like others suggested.
For completeness, my own implementation of that:
prepend() {
local tmp=$(tempfile)
if cat - "${1}" > "${tmp}"; then
mv "${tmp}" "${1}"
else
rm -f "${tmp}"
# some error reporting
fi
}
Note that you unlike #jpa suggested, you should be writing the concatenated data to a temporary file as that operation can fail and if it does, you don't want to lose your original file. Afterwards, you just replace the old file with new one, or delete the temporary file and handle the failure any way you like.
Synopsis the same as with the other solution:
echo test | prepend file.txt
And a bit modified version to retain permissions and play safe with symlinks (if that is necessary) like >> does:
prepend() {
local tmp=$(tempfile)
if cat - "${1}" > "${tmp}"; then
cat "${tmp}" > "${1}"
rm -f "${tmp}"
else
rm -f "${tmp}"
# some error reporting
fi
}
Just note that this version is actually less safe since if during second cat something else will write to disk and fill it up, you'll end up with incomplete file.
To be honest, I wouldn't personally use it but handle symlinks and resetting permissions externally, if necessary.
^ is a poor choice of character, as it is already used in history substitution.
To add a new redirection type to the shell grammar, start in parse.y. Declare it as a new %token so that it may be used, add it to STRING_INT_ALIST other_token_alist[] so that it may appear in output (such as error messages), update the redirection rule in the parser, and update the lexer to emit this token upon encountering the appropriate characters.
command.h contains enum r_instruction of redirection types, which will need to be extended. There's a giant switch statement in make_redirection in make_cmd.c processing redirection instructions, and the actual redirection is performed by functions throughout redir.c. Scattered throughout the rest of source code are various functions for printing, copying, and destroying pipelines, which may also need to be updated.
That's all! Bash isn't really that complex.
This doesn't discuss how to implement a prepending redirection, which will be difficult as the UNIX file API only provides for appending and overwriting. The only way to prepend to a file is to rewrite it entirely, which (as other answers mention) is significantly more complex than any existing shell redirections.
Might be quite difficult to add an operator, but perhaps a function could be enough?
function prepend { tmp=`tempfile`; cp $1 $tmp; cat - $tmp > $1; rm $tmp; }
Example use:
echo foobar | prepend file.txt
prepends the text "foobar" to file.txt.
I think bash's plugin architecture (loading shared objects via the 'enable' built-in command) is limited to providing additional built-in commands. The redirection operators are part of they syntax for running simple commands, so I think you would need to modify the parser to recognize and handle your new ^> operator.
Most Linux filesystems do not support prepending. In fact, I don't know of any one that has a stable userspace interface for it. So, as stated by others already, you can only rely on overwriting, either just the initial parts, or the entire file, depending on your needs.
You can easily (partially) overwrite initial file contents in Bash, without truncating the file:
exec {fd}<>"$filename"
printf 'New initial contents' >$fd
exec {fd}>&-
Above, $fd is the file descriptor automatically allocated by Bash, and $filename is the name of the target file. Bash opens a new read-write file descriptor to the target file on the first line; this does not truncate the file. The second line overwrites the initial part of the file. The position in the file advances, so you can use multiple commands to overwrite consecutive parts in the file. The third line closes the descriptor; since there is only a limited number available to each process, you want to close them after you no longer need them, or a long-running script might run out.
Please note that > does less than you expected:
Remove the > and the following word from the commandline, remembering the redirection.
When the commandline is processed and the command can be launched, calling fork(2) (or clone(2)), to create a new process.
Modify the new process according to the command. That includes things like modified environment variables (SOMEVAR=foo yourcommand), but also changed filedescriptors. At this point, a > yourfile from the cmdline will have the effect that the file is open(2)'ed at the stdout filedescriptor (that is #1) in write-only mode truncating the file to zero bytes. A >> yourfile would have the effect that the file is oppend at stdout in write-only mode and append mode.
(Only now launch the program, like execv(yourprogram, yourargs)
The redirections could, for a simple example, be implemented like
open(yourfile, O_WRONLY|O_TRUNC);
or
open(yourfile, O_WRONLY|O_APPEND);
respectively.
The program then launched will have the correct environment set up, and can happily write to fd1. From here, the shell is not involved. The real work is not done by the shell, but by the operating system. As Unix doesn't have a prepend mode (and it would be impossible to integrate that feature correctly), everything you could try would end up in a very lousy hack.
Try to re-think your requirements, there's always a simpler way around.

When is a file created when using output redirection?

I have a script that runs in AIX ksh that looks like this:
wc dir1/* dir2/* | {awk command to rearrange output} | {grep command to filter more} > dir2/output.txt
It is a precondition to this line that dir2/output.txt does not exist.
The issue is that dir2/output.txt has contained itself in the output (it's happened a handful of times out of hundreds of times with no problem). dir1 and dir2 are NFS-mounted.
Is it related to the implementation of wc -- what if the the first parameter takes a long time? I think not, as I've tried the following:
wc `sleep 5` *.txt > out.txt
Even in this case out.txt does not list itself.
As a last note, wildcards are used in this example where they are used in the actual script. So if the expansion happens first, why does this problem occur?
At what point is dir2/output.txt actually created?
Redirections are done by the shell, as are globs. Your problem is that, in the case of a pipeline, each pipeline stage is a separate subprocess; whether the shell subprocess that does the final redirection runs before the one that builds the glob of input files for wc will depend on details of the scheduler and system load, among other things, and should be considered indeterminate.
In short, you should assume that this will happen and either exclude dir2/output.txt (take a look at ksh extended glob patterns; in particular, something along the lines of dir2/!(output.txt) may be useful) or create the output somewhere else and mv it to its final location afterward.

C:copying multiple files into one

I am stuck/struggling with a problem I am trying in C(Linux) using API calls(only) to copy multiple input files via command line into one output file. I have searched the Internet for answers but none seem to solve.
My program allows me to specify multiple input files and one output file via the command line. For example:
./archiver file1.txt file2 file3 file4 outputfile
I read these parameters using argc/argv. For some reason when I do ls -l, ./archiver and outputfile have the same number of bytes, thus meaning none of my input files have been copied to my outputfile, just whatever was in memory (when I do cat outputfile it shows a bunch of these )
None of the contents from my input files are in my output files.
Please could you help me as after those bunch of "" I don't know what to do I have tried reading up on malloc() etc. but I don't know how to implement that or if thats even relevant here.
Any help is appreciated, thanks for your time.
file_desc_in = open(argv[i],O_RDONLY,0);
//NEED a loop to copy multiple files in...
while (!eof) {
bytes_read = read(file_desc_in, &buffer, sizeof(buffersize));
if (bytes_read > 1)
bytes_written = write(file_desc_out, &i, bytes_read);
else {
eof=1;
}
I haven't included the errors but I do have them. Thanks for replying immediately.
It'd help to see your code. There's not a lot here to go on, but I'm going to take a wild guess. I suspect you're copying the file specified by argv[0] (your program) and not getting the rest. I don't think I can do any better with what you've given.
You say you are only using API calls. What API are you talking about? The POSIX API? The standard C file I/O API?
If you are just combining input files, you don't really need to write a C program to do it. Since you are running Linux, try using the shell command cat input1 input2 input3 > output.
If you must write a C program to do it, start simple. Before you actually do any file I/O, make sure that you can interpret the input arguments correctly. Have your program simply read in the command-line input and print out something like this:
Input files: file1.txt file2.txt file2.txt
Output files: outputfile.txt
That way, you can verify that your CLI parsing code works correctly before you start worrying about file I/O. It's much easier to debug things one piece at a time.
Your outer loop needs to open each filename, and close it at the end of the loop. You close the output file at the very end, after all the input files are read.
You should also learn the difference between open, read, write and fopen, fread, fwrite.

Resources