I want to break a big file into to smaller files mainly.
I use stream because I do not want to keep the big file in my disk.
What I am looking it is something similar to:
sed -n 'a,bp,' #this uses lines in file while i want bytes
or:
cat filename|head -c a| tail -c (a-b) # this way takes too long with big files
If performance is an issue, and you are using large files, I think you would do better with a bigger block size in dd, like this
dd bs=$a skip=1 if=filename | dd "bs=$((b-a))" count=1
If you want to extract from byte offset a to byte offset b, you can use the dd command:
dd bs=1 "skip=$a" "count=$(($b - $a))" if=filename
The quotes are optional. The main problem to worry about is whether the shell arithmetic will handle offsets bigger than 31-bits (2 GiB). Most likely it won't be an issue (for example, 64-bit Bash handles 12-digit numbers with ease on Mac OS X), but be cautious if you need to deal with really large files on 32-bit systems. You could use bc instead of the built-in $((…arithmetic…)) notation, if need be.
Related
I have 3 executable files MyExe1, MyExe2 and MyExe3 that I can run on the terminal (of my MacOs) as
$ ./MyExe1 9
9 is odd
$ ./MyExe2 9
9 is odd
$ ./MyExe3 9
9 is odd
$ ./MyExe1 8
8 is even
The inputs I have in my file MyInputs.txt (which contains the numbers 0, 1, ... 100).
How could I get the output printed ONLY for all the numbers where the outputs (for the same input) are different?
MyExe1, MyExe2 and MyExe3 accept only a single input (indeed a number), not a file, not a vector of numbers.
I found that there are several ways.. one is using popen(), execl(), system(). It would be possible to invoke MyExe1, MyExe2 and MyExe3 with popen()?
In a shell script on a Mac, you could write (test-exe.sh):
out1="MyExe1.output"
out2="MyExe2.output"
out3="MyExe3.output"
for file in "$out1" "$out2" "$out3"; do cp /dev/null "$file"; done
while read number
do
./MyExe1 "$number" >> "$out1"
./MyExe2 "$number" >> "$out2"
./MyExe3 "$number" >> "$out3"
done
diff3 "$out1" "$out2" "$out3"
Then you can read the numbers from MyInputs.txt by running:
$ sh test-exe.sh < MyInputs.txt
…maybe some output…
$
If the outputs are all the same, there will be no output. If there's a difference somewhere, the diff3 program will report it.
I have the impression that diff3 "$out1" "$out2" "$out3" requires the complete calculation of MyExe# (with all the input numbers from 1 to 100). Is not possible to avoid it? (If I need to deal with more numbers (vg. from 1 to 10 00000) I am afraid I will run [into] memory troubles.
There are various comments to make. First, diff3 works on the three files it is given. It doesn't much care how big they are. You could run it on the output from each individual run of the executables.
Second, the outputs shown are quite small (say less than 30 bytes per run) — so one million outputs for each program doesn't generate 30 MiB of data, or 100 MiB in total. There's nothing to stop you from running the script for 10,000 lines at a time and running diff3 on each intermediate file.
For my money, I'd revise the programs so that they can read from standard input and do their calculations repeatedly. If there are command-line arguments, process them as now (except allow more than one argument per invocation). If there are no arguments, read lines from standard input and process the value on each line. Again, this means the code potentially processes multiple inputs per invocation.
You can then divvy the input file up into chunks of a convenient size and run the three executables and compare the three results. To borrow Perl's motto: TMTOWTDI — There's more than one way to do it.
I'm working on a simple arc injection exploit, wherein this particular string gives me the desired address of the place where I'd like to jump: Á^F#^#. This is the address 0x004006c1 (I'm using a 64 bit Intel processor, so x86-64 with little endian arrangement).
When I provide this string Á^F#^# as input to a vulnerable gets() routine in my function and inspect the addresses using gdb, the address gets modified to 0x00400681 instead of 0x004006c1. I'm not quite sure as to why this is happening. Furthermore, is there any way to easily provide hexadecimal values to a gets routine at stdin? I've tried doing something like: 121351...12312\xc1\x06\x40\x00, but instead of picking up \xc1 as it is, it translates individual character to hex, so I get something like 5c78.. (hex for \ and x, followed by hex for c and 1).
Any help is appreciated, thanks!
You could just put the raw bytes into a file somewhere and pipe it directly into your application.
$ path/to/my_app <raw_binary_data
Alternatively, you could wrap the application in a shell script that converts escaped hex bytes into their corresponding byte values. The echo utility will do this when the -e switch is set on the command line, for example:
$ echo '\x48\x65\x6c\x6c\x6f'
\x48\x65\x6c\x6c\x6f
$ echo -e '\x48\x65\x6c\x6c\x6f'
Hello
You can use this feature to process your application's input as follows:
while read -r line; do echo -e $line; done | path/to/my_app
To terminate the input, try pressing ControlD or ControlC.
I'm on Linux. I have a list of files and I'd like to overwrite them with zeros and remove them. I tried using
srm file1 file2 file3 ...
but it's too slow (I have to overwrite and remove ~50 GB of data) and I don't need that kind of security (I know that srm does a lot of passes instead of a single pass with zeros).
I know I could overwrite every single file using the command
cat /dev/zero > file1
and then remove it with rm, but I can't do that manually for every single file.
Is there a command like srm that does a single pass of zeros, or maybe a script that can do cat /dev/zero on a list of files instead of on a single one? Thank you.
Something like this, using stat to get the correct size to write, and dd to overwrite the file, might be what you need:
for f in $(<list_of_files.txt)
do
read blocks blocksize < <(stat -c "%b %B" ${f})
dd if=/dev/zero bs=${blocksize} count=${blocks} of=${f} conv=notrunc
rm ${f}
done
Use /dev/urandom instead of /dev/zero for (slightly) better erasure semantics.
Edit: added conv=notrunc option to dd invocation to avoid truncating the file when it's opened for writing, which would cause the associated storage to be released before it's overwritten.
I use shred for doing this.
The following are the options that I generally use.
shred -n 3 -z <filename> - This will make 3 passes to overwrite the file with random data. It will then make a final pass overwriting the file with zeros. The file will remain on disk though, but it'll all the 0's on disk.
shred -n 3 -z -u <filename> - Similar to above, but also unlinks (i.e. deletes) the file. The default option for deleting is wipesync, which is the most secure but also the slowest. Check the man pages for more options.
Note: -n is used here to control the number of iterations for overwriting with random data. Increasing this number, will result in the shred operation taking longer to complete and better shredding. I think 3 is enough but maybe wrong.
The purpose of srm is to destroy the data in the file before releasing its blocks.
cat /dev/null > file is not at all equivalent to srm because
it does not destroy the data in the file: the blocks will be released with the original data intact.
Using /dev/zero instead of /dev/null does not even work because /dev/zero never ends.
Redirecting the output of a program to the file will never work for the same reason given for cat /dev/null.
You need a special-purpose program that opens the given file for writing, writes zeros over all bytes of the file, and then removes the file. That's what srm does.
Is there a command like srm that does a single pass of zeros,
Yes. SRM does this with the correct parameters. From man srm:
srm -llz
-l lessens the security. Only two passes are written: one mode with
0xff and a final mode random values.
-l -l for a second time lessons the security even more: only one
random pass is written.
-z wipes the last write with zeros instead of random data
srm -llzr will do the same recursively if wiping a directory.
You can even use 'srm -llz [file1] [file2] [file3] to wipe multiple files i this way with a single command
I'm looking to replace characters at specific byte offsets.
Here's what is provided:
An input file that is simple ASCII text.
An array within a Bash shell script, each element of the array is a numerical byte-offset value.
The goal:
Take the input file, and at each of the byte-offsets, replace the character there with an asterisk.
So essentially the idea I have in mind is to somehow go through the file, byte-by-byte, and if the current byte-offset being read is a match for an element value from the array of offsets, then replace that byte with an asterisk.
This post seems to indicate that the dd command would be a good candidate for this action, but I can't understand how to perform the replacement multiple times on the input file.
Input file looks like this:
00000
00000
00000
The array of offsets looks this:
offsetsArray=("2" "8" "9" "15")
The output file's desired format looks like this:
0*000
0**00
00*00
Any help you could provide is most appreciated. Thank you!
Please check my comment about about newline offset. Assuming this is correct (note I have changed your offset array), then I think this should work for you:
#!/bin/bash
read -r -d ''
offsetsArray=("2" "8" "9" "15")
txt="${REPLY}"
for i in "${offsetsArray[#]}"; do
txt="${txt:0:$i-1}*${txt:$i}"
done
printf "%s" "$txt"
Explanation:
read -d '' reads the whole input (redirected file) in one go into the $REPLY variable. If you have large files, this can run you out of memory.
We then loop through the offsets array, one element at a time. We use each index i to grab i-1 characters from the beginning of the string, then insert a * character, then add the remaining bytes from offset i. This is done with bash parameter expansion. Note that while your offsets are one-based, bash strings use zero-based indexing.
In use:
$ ./replacechars.sh < input.txt
0*000
0**00
00*00
$
Caveat:
This is not really a very efficient solution, as it causes the sting containing the whole file to be copied for every offset. If you have large files and/or a large number of offsets, then this will run slowly. If you need something faster, then another language that allows modification of individual characters in a string would be much better.
The usage of dd can be a bit confusing at the time, but it's not that hard:
outfile="test.txt"
# create some test data
echo -n 0123456789abcde > "$outfile"
offsetsArray=("2" "7" "8" "13")
for offset in "${offsetsArray[#]}"; do
dd bs=1 count=1 seek="$offset" conv=notrunc of="$outfile" <<< '*'
done
cat "$outfile"
Important for this example is to use conv=notrunc, otherwise dd truncates the file to the length of blocks it seeks over. bs=1 specifies that you want to work with blocks of size 1, and seek specifies the offset to satart writing count blocks to.
The above produces 01*3456**9abc*e
With the same offset considerations as #DigitalTrauma's superior solution, here's a GNU awk-based alternative. This assumes your file contains no null bytes
(IFS=','; awk -F '' -v RS=$'\0' -v OFS='' -v offsets="${offsetsArray[*]}" \
'BEGIN{split(offsets, o, ",")};{for (k in o) $o[k]="*"; print}' file)
0*000
0**00
00*00
I have a project for school which implies making a c program that works like tar in unix system. I have some questions that I would like someone to explain to me:
The dimension of the archive. I understood (from browsing the internet) that an archive has a define number of blocks 512 bytes each. So the header has 512 bytes, then it's followed by the content of the file (if it's only one file to archive) organized in blocks of 512 bytes then 2 more blocks of 512 bytes.
For example: Let's say that I have a txt file of 0 bytes to archive. This should mean a number of 512*3 bytes to use. Why when I'm doing with the tar function in unix and click properties it has 10.240 bytes? I think it adds some 0 (NULL) bytes, but I don't know where and why and how many...
The header chcksum. As I know this should be the size of the archive. When I check it with hexdump -C it appears like a number near the real size (when clicking properties) of the archive. For example 11200 or 11205 or something similar if I archive a 0 byte txt file. Is this size in octal or decimal? My bets are that is in octal because all information you put in the header it needs to be in octal. My second question at this point is what is added more from the original size of 10240 bytes?
Header Mode. Let's say that I have a file with 664, the format file will be 0, then I should put in header 0664. Why, on a authentic archive is printed 3 more 0 at the start (000064) ?
There have been various versions of the tar format, and not all of the extensions to previous formats were always compatible with each other. So there's always a bit of guessing involved. For example, in very old unix systems, file names were not allowed to have more than 14 bytes, so the space for the file name (including path) was plenty; later, with longer file names, it had to be extended but there wasn't space, so the file name got split in 2 parts; even later, gnu tar introduced the ##LongLink pseudo-symbolic links that would make older tars at least restore the file to its original name.
1) Tar was originally a *T*ape *Ar*chiver. To achieve constant througput to tapes and avoid starting/stopping the tape too much, several blocks needed to be written at once. 20 Blocks of 512 bytes were the default, and the -b option is there to set the number of blocks. Very often, this size was pre-defined by the hardware and using wrong blocking factors made the resulting tape unusable. This is why tar appends \0-filled blocks until the tar size is a multiple of the block size.
2) The file size is in octal, and contains the true size of the original file that was put into the tar. It has nothing to do with the size of the tar file.
The checksum is calculated from the sum of the header bytes, but then stored in the header as well. So the act of storing the checksum would change the header, thus invalidate the checksum. That's why you store all other header fields first, set the checksum to spaces, then calculate the checksum, then replace the spaces with your calculated value.
Note that the header of a tarred file is pure ascii. This way, In those old days, when a tar file (whose components were plain ascii) got corrupted, an admin could just open the tar file with an editor and restore the components manually. That's why the designers of the tar format were afraid of \0 bytes and used spaces instead.
3) Tar files can store block devices, character devices, directories and such stuff. Unix stores these file modes in the same place as the permission flags, and the header file mode contains the whole file mode, including file type bits. That's why the number is longer than the pure permission.
There's a lot of information at http://en.wikipedia.org/wiki/Tar_%28computing%29 as well.