Easiest way to overwrite a series of files with zeros - file

I'm on Linux. I have a list of files and I'd like to overwrite them with zeros and remove them. I tried using
srm file1 file2 file3 ...
but it's too slow (I have to overwrite and remove ~50 GB of data) and I don't need that kind of security (I know that srm does a lot of passes instead of a single pass with zeros).
I know I could overwrite every single file using the command
cat /dev/zero > file1
and then remove it with rm, but I can't do that manually for every single file.
Is there a command like srm that does a single pass of zeros, or maybe a script that can do cat /dev/zero on a list of files instead of on a single one? Thank you.

Something like this, using stat to get the correct size to write, and dd to overwrite the file, might be what you need:
for f in $(<list_of_files.txt)
do
read blocks blocksize < <(stat -c "%b %B" ${f})
dd if=/dev/zero bs=${blocksize} count=${blocks} of=${f} conv=notrunc
rm ${f}
done
Use /dev/urandom instead of /dev/zero for (slightly) better erasure semantics.
Edit: added conv=notrunc option to dd invocation to avoid truncating the file when it's opened for writing, which would cause the associated storage to be released before it's overwritten.

I use shred for doing this.
The following are the options that I generally use.
shred -n 3 -z <filename> - This will make 3 passes to overwrite the file with random data. It will then make a final pass overwriting the file with zeros. The file will remain on disk though, but it'll all the 0's on disk.
shred -n 3 -z -u <filename> - Similar to above, but also unlinks (i.e. deletes) the file. The default option for deleting is wipesync, which is the most secure but also the slowest. Check the man pages for more options.
Note: -n is used here to control the number of iterations for overwriting with random data. Increasing this number, will result in the shred operation taking longer to complete and better shredding. I think 3 is enough but maybe wrong.

The purpose of srm is to destroy the data in the file before releasing its blocks.
cat /dev/null > file is not at all equivalent to srm because
it does not destroy the data in the file: the blocks will be released with the original data intact.
Using /dev/zero instead of /dev/null does not even work because /dev/zero never ends.
Redirecting the output of a program to the file will never work for the same reason given for cat /dev/null.
You need a special-purpose program that opens the given file for writing, writes zeros over all bytes of the file, and then removes the file. That's what srm does.

Is there a command like srm that does a single pass of zeros,
Yes. SRM does this with the correct parameters. From man srm:
srm -llz
-l lessens the security. Only two passes are written: one mode with
0xff and a final mode random values.
-l -l for a second time lessons the security even more: only one
random pass is written.
-z wipes the last write with zeros instead of random data
srm -llzr will do the same recursively if wiping a directory.
You can even use 'srm -llz [file1] [file2] [file3] to wipe multiple files i this way with a single command

Related

Can a tar file be efficiently randomly edited?

Is there a way to modify an individual file within a tar file without having to rewrite the entire archive? I recognize this would probably result in fragmentation.
Is there any other archive format that does this?
First off, you should only ask exactly one question on StackOverflow. If you truly want to do frequent writes to the "archive", then you might be better off simply creating a large file, formatting it with some file system of your choice and then mounting it:
truncate -s $(( 512*1024*1024 )) 512MiB-filesystem.ext4
mkfs.ext4 512MiB-filesystem.ext4
sudo mount -o loop 512MiB-filesystem.ext4 mountpoint
sudo chmod a+w mountpoint/
echo foo > mountpoint/bar
sudo umount mountpoint
As for your question about TAR. It is possible and a fun exercise but it might lack the tools that actually implement this. First off, TAR is a very simple file format, it consists of 512 B blocks that can either contain metadata or actual file contents simply copied from the original file without any compression.
A TAR can actually contain multiple files for the same path and by convention, the last duplicate path wins. This means, in order to "modify" a file, you can simply append a newer version of that file to the TAR:
tar --append --file archive.tar modified-file
This should be fast, but it would grow the archive with every file change, so it should be used sparingly.
If you want even more in-place modifications, they should be possible but there is no tooling yet for that as far as I know. I would like to implement that into ratarmount but I'm not sure when I'll get to it.
File system operations and how to implement them:
Modifying a file:
File size is constant: As long as the file size does not change, we could simply change the file inside the TAR if we know the offset for the file contents in the TAR archive, which ratarmount does have stored in an SQLite database.
File size is quasi constant: Actually, the file size might even change by up to 511 B and it still would be possible to simply update the file inside the TAR as long as it doesn't change the number of required TAR blocks (512 B). This would also require updating the file size in the TAR metadata block and updating the checksum of that metadata block, though.
Required TAR blocks shrink: If the required TAR blocks become fewer than before, then it still would be rather easy to modify the TAR on the fly as outlined above. But we would have to somehow format the unused blocks. We could simply fill them with zeros, but in this case, we would have to call tar with the --ignore-zeros option to still get a valid tar. Without that, all files after that position would suddenly appear lost, so it might be unsuited in some circumstances. But we could also simply fill the empty blocks with dummy data, e.g., a directory metadata entry for the / (root) folder. As long as it contains the same metadata as the actual root folder, it basically is a no-op. It might even be possible to create dummy metadata blocks for invalid paths like . or .. to effectively create blocks that are ignored even without the --ignore-zeros option.
*Required TAR blocks grow:` This is the most difficult case. If there is simply no space to put the added data to the file, then we might have to delete it and move it to the end of the file (if it isn't already at the end). Removing the file without rewriting everything else in the TAR would be implemented as mentioned above by either filling the parts with zeros or dummy metadata blocks. At this point, however, we could implement defragmentation techniques, e.g., by keeping track of all empty / dummy blocks in the TAR and looking for fitting places. Or if we want to append 1 KiB to a 1 GiB file, then it might avoid fragmentation better if we move a small file right after the 1 GiB file to the end of the TAR to make space for the 1 KiB to append.
Modifying file metadata:
In General: In general, metadata can be changed by simply changing it in the metadata block and updating the block checksum. This does not require rewriting anything else in the archive
Removals: This is basically the same as file modifications for shrinking block counts. Simply overwrite the space for this file entry with zeros or dummy blocks and maybe keep track of it for writing files into this space at a later time.
Renames: Renames can actually be more tricky than one might think. In most cases, it can also simply be updated, however, there are two problematic cases:
The file name becomes too long: If the file name becomes too long, then the GNU long name extension will allocate further blocks right after the TAR metadata block, which will contain the very long filename. This however would require one more block, which might require moving around blocks inside the TAR as outlined for file modifications
There are file name collisions: If the target path already exists, then simply updating the metadata might not suffice depending on the order the files appear in the TAR. The last one with the same path wins. This might be easy to circumvent by simply forbidding to move to an existing path or by calling remove on the existing file beforehand.
Create: This is simple. Simply append the file to the end of the archive. If implemented manually, then we might have to find the actual end of the data because TAR archives have at least 2 (often more) zero-byte blocks after the last valid data and simply appending new files after those zero blocks would require the --ignore-zero-bytes option.

Potential Dangers of Running Code in Parallel

I am working in OSX and using bash for my shell. I have a script which calls an executable hundreds of times, and each call is independent of the other. Therefore I am going to run this code in parallel. However, each call to the executable appends output to a community text file on a new line.
The ordering of the text file is not of importance (although it would be nice, but totally not worth over complicating since I can just use unix sort command), but what is, is that every call of the executable properly printed to the file. My concern is that if I run the script in parallel that the by some freak accident, two threads will check out the text file, print to it and then save different copies back to the original directory of the text file. Thus nullifying one of the writes to the file.
Does this actually happen, or is my understanding of printing to a file flawed? I don't fully know if this would also be a case by case bases so I will provide some mock code of what is being done in my program below.
Script:
#!/bin/sh
abs=$1
input=$(echo "$abs" | awk '{print 0.004 + 0.005*$1 }')
./program input
"./program":
~~Normal .c file stuff here~~
~~VALUE magically calculated here~~
~~run number is pulled out of input and assigned to index for sorting~~
FILE *fpp;
fpp = fopen("Doc.txt","a");
fprintf(fpp,"%d, %.3f\n", index, VALUE);
fclose(fpp);
~Closing events of program.c~~
Commands to run script in parallel in bash:
printf "%s\n" {0..199} | xargs -P 8 -n 1 ./program
Thanks for any help you guys can offer.
A write() call (like fwrite()) with the append flag set in open() (like during fopen()) is guaranteed to avoid the race condition you describe.
O_APPEND
If set, the file offset shall be set to the end of the file prior to each write.
From: POSIX specifications for open:
opengroup.org open
Race conditions are what you are thinking of.
Not 100% sure but if you simple append to the end of the file rather than opening it and editing it should be right
If you have the option, make your program write to standard output instead of directly to a file. Then you can let the shell merge the output of your programs:
printf "%s\n" {0..199} | parallel -P 8 -n 1 ./program > merged_output.txt
Yeah, that looks like a recipe for disaster. If those processes both hit opening the file at the roughly the same time, only one will "take".
I suggest either (easier) writing to separate files then catting them together when the processing is done, or (harder) sending all results to a consumer process that will write the file for everyone.

How do I add an operator to Bash in Linux?

I'd like to add an operator ( e.g. ^> ) to handle prepend instead append (>>). Do I need to modify Bash source or is there an easier way (plugin, etc)?
First of all, you'd need to modify bash sources and quite heavily. Because, above all, your ^> would be really hard to implement.
Note that bash redirection operators usually do a very simple writes, and work on a single file (or program in case of pipes) only. Excluding very specific solutions, you usually can't write to a beginning of a file for the very simple reason you'd need to move all remaining contents forward after each write. You could try doing that but it will be hard, very ineffective (since every write will require re-writing the whole file) and very unsafe (since with any error you will end up with random mix of old and new version).
That said, you are indeed probably better off with a function or any other solution which would use a temporary file, like others suggested.
For completeness, my own implementation of that:
prepend() {
local tmp=$(tempfile)
if cat - "${1}" > "${tmp}"; then
mv "${tmp}" "${1}"
else
rm -f "${tmp}"
# some error reporting
fi
}
Note that you unlike #jpa suggested, you should be writing the concatenated data to a temporary file as that operation can fail and if it does, you don't want to lose your original file. Afterwards, you just replace the old file with new one, or delete the temporary file and handle the failure any way you like.
Synopsis the same as with the other solution:
echo test | prepend file.txt
And a bit modified version to retain permissions and play safe with symlinks (if that is necessary) like >> does:
prepend() {
local tmp=$(tempfile)
if cat - "${1}" > "${tmp}"; then
cat "${tmp}" > "${1}"
rm -f "${tmp}"
else
rm -f "${tmp}"
# some error reporting
fi
}
Just note that this version is actually less safe since if during second cat something else will write to disk and fill it up, you'll end up with incomplete file.
To be honest, I wouldn't personally use it but handle symlinks and resetting permissions externally, if necessary.
^ is a poor choice of character, as it is already used in history substitution.
To add a new redirection type to the shell grammar, start in parse.y. Declare it as a new %token so that it may be used, add it to STRING_INT_ALIST other_token_alist[] so that it may appear in output (such as error messages), update the redirection rule in the parser, and update the lexer to emit this token upon encountering the appropriate characters.
command.h contains enum r_instruction of redirection types, which will need to be extended. There's a giant switch statement in make_redirection in make_cmd.c processing redirection instructions, and the actual redirection is performed by functions throughout redir.c. Scattered throughout the rest of source code are various functions for printing, copying, and destroying pipelines, which may also need to be updated.
That's all! Bash isn't really that complex.
This doesn't discuss how to implement a prepending redirection, which will be difficult as the UNIX file API only provides for appending and overwriting. The only way to prepend to a file is to rewrite it entirely, which (as other answers mention) is significantly more complex than any existing shell redirections.
Might be quite difficult to add an operator, but perhaps a function could be enough?
function prepend { tmp=`tempfile`; cp $1 $tmp; cat - $tmp > $1; rm $tmp; }
Example use:
echo foobar | prepend file.txt
prepends the text "foobar" to file.txt.
I think bash's plugin architecture (loading shared objects via the 'enable' built-in command) is limited to providing additional built-in commands. The redirection operators are part of they syntax for running simple commands, so I think you would need to modify the parser to recognize and handle your new ^> operator.
Most Linux filesystems do not support prepending. In fact, I don't know of any one that has a stable userspace interface for it. So, as stated by others already, you can only rely on overwriting, either just the initial parts, or the entire file, depending on your needs.
You can easily (partially) overwrite initial file contents in Bash, without truncating the file:
exec {fd}<>"$filename"
printf 'New initial contents' >$fd
exec {fd}>&-
Above, $fd is the file descriptor automatically allocated by Bash, and $filename is the name of the target file. Bash opens a new read-write file descriptor to the target file on the first line; this does not truncate the file. The second line overwrites the initial part of the file. The position in the file advances, so you can use multiple commands to overwrite consecutive parts in the file. The third line closes the descriptor; since there is only a limited number available to each process, you want to close them after you no longer need them, or a long-running script might run out.
Please note that > does less than you expected:
Remove the > and the following word from the commandline, remembering the redirection.
When the commandline is processed and the command can be launched, calling fork(2) (or clone(2)), to create a new process.
Modify the new process according to the command. That includes things like modified environment variables (SOMEVAR=foo yourcommand), but also changed filedescriptors. At this point, a > yourfile from the cmdline will have the effect that the file is open(2)'ed at the stdout filedescriptor (that is #1) in write-only mode truncating the file to zero bytes. A >> yourfile would have the effect that the file is oppend at stdout in write-only mode and append mode.
(Only now launch the program, like execv(yourprogram, yourargs)
The redirections could, for a simple example, be implemented like
open(yourfile, O_WRONLY|O_TRUNC);
or
open(yourfile, O_WRONLY|O_APPEND);
respectively.
The program then launched will have the correct environment set up, and can happily write to fd1. From here, the shell is not involved. The real work is not done by the shell, but by the operating system. As Unix doesn't have a prepend mode (and it would be impossible to integrate that feature correctly), everything you could try would end up in a very lousy hack.
Try to re-think your requirements, there's always a simpler way around.

Using grep with large pattern file

I just wanted to use grep with option -f FILE. This should make grep use every line of FILE as a pattern and search for it.
Run:
grep -f patternfile searchfile
The pattern-file I used is 400MB large. The file I want to search through is 7GB.
After 3 min the process ended up with 70GB RAM and no reaction.
Is this normal? Am I doing something wrong? Is grep not capable is such large scale?
Thank you for ideas.
If the lines in the pattern file are literal strings, using the "-F" option will make it much faster.
You could try breaking the task up such that the grep process ends on each pass of the file. I'm not sure how useful this will be, however, given the sheer size of the file you're searching.
for pattern in `cat patternFile`
do
grep "$pattern" searchFile
done
I have to say that this is the first time I've ever heard of anyone using a 700MB pattern file before - I'm not surprised it ate up so much memory.
If you have time, I would suggest either breaking the file up into sections and processing each section one at a time, or even just processing the 7GB file one regex at a time. If you can fit the whole 7GB file in memory then, and aren't worried about how long it takes, then that might be the most reliable solution.

removing a line from a text file?

I am working with a text file, which contains a list of processes under my programs control, along with relevant data.
At some point, one of the processes will finish, and thus will need to be removed from the file (as its no longer under control).
Here is a sample of the file contents (which has enteries added "randomly"):
PID=25729 IDLE=0.200000 BUSY=0.300000 USER=-10.000000
PID=26416 IDLE=0.100000 BUSY=0.800000 USER=-20.000000
PID=26522 IDLE=0.400000 BUSY=0.700000 USER=-30.000000
So for example, if I wanted to remove the line that says PID=26416.... how could I do that, without writing the file over again?
I can use external unix commands, however I am not very familiar with them so please if that is your suggestion, give an example.
Thanks!
Either you keep the contents of the file in temporary memory and then rewrite the file. Or you could have a file for each of the PIDs with the relevant information in them. Then you simply delete the file when it's no longer running. Or you could use a database for this instead.
As others have already pointed out, your only real choice is to rewrite the file.
The obvious way to do that with "external UNIX commands" would be grep -v "PID=26416" (or whatever PID you want to remove, obviously).
Edit: It is probably worth mentioning that if the lines are all the same length (as you've shown here) and order doesn't matter, you could delete a line more efficiently by copying the last line into the space being vacated, then shorten the file so eliminate what had been the last line. This will only work if they really are all the same length though (e.g., if you got a PID of '1', you'd need to pad it to the same length as the others in the file).
The only way is by copying each character that comes after the deleted line down over the characters that are deleted.
It is far more efficient to simply rewrite the file.
how could I do that, without writing the file over again?
You cannot. Filesystems (perhaps besides more esoteric record based ones) does not support insertion or deletion.
So you'll have to write the lines to a temporary file up till the line you want to delete, skip over that line, and write the rest of the lines to the file. When done, rename/copy the temp file to the original filename
Why are you maintaining these in a text file? That's not the best model for such a task. But, if you're stuck with it ... if these lines are guaranteed to all be the same length (it appears that way from the sample), and if the order of the lines in the file doesn't matter, then you can write the last line over the line for the process that has died and then shorten the file by one line with the (f)truncate() call if you're on a POSIX system: see Jonathan Leffler's answer in How to truncate a file in C?
But note carefully netrom's answer, which gives three different better ways to maintain this info.
Also, if you stick with a text file (preferably written from scratch each time from data structures you maintain, as per netrom's first suggestion), and you want to be sure that the file is always well formed, then write the new data into a temp file on the same device (putting it in the same directory is easiest) and then do a rename() call, which is an atomic operation.
You can use sed:
sed -i.bak -e '/PID=26416/d' test
-i is for editing in place. It also creates a back-up file with the new extension .bak
-e is for specifying the pattern. The /d indicates all lines matching the pattern should be deleted.
test is the filename
The unix command for it is:
grep -v "PID=26416" myfile > myfile.tmp
mv myfile.tmp myfile
The grep -v part outputs the file without the rows with the search term.
The > myfile.tmp part creates a new temp file for this output.
The mv part renames the temp file to the original file.
Note that we are rewriting the file here, and moreover, we can lose data if someone write something to file between the two commands.

Resources