how to extract specific bytes from a file using unix

how to extract specific bytes from a file using unix - file

how do I extract 12byte chunks from a binary file at certain positions within the file.
If I wanted to extract the first 12 bytes I could do something like
head -c12 file.bin>output
If I wanted to extract 12 bytes from byte61 I could do something like
head -c72 file.bin|tail -c12 >output
Is there a simpler way if I have something like 20 12byte chunks I need to extract
thanks

Use dd:
dd bs=1 seek=60 count=12 if=file.bin of=output
You can write a shell loop to substitute the numbers.
You could also consider using awk, Perl or Python, if there's a lot of them to do or it needs to be really fast.

Using xxd:
xxd -p -seek 3d -l 12 file.bin > output
3d means 61 in hexadecimal
Using hexdump:
hexdump -ve '16/1 "%0.2x " "\n"' -s 3d -n 12 file.bin > output

Related

Concatenate text files, separating them with a new line

I have just over 100 text files in a directory, functioning as a simple database with each line containing one record. In total, these files add up to around 25GB. However, the records are not sorted alphabetically and there are many duplicates, so in order to alphabetise the contents of all ~100 text files using something like sort -u, I'm first trying to combine all of these files into a single large text file. A simple cat would be unsuitable because the beginnings and ends of the 100 text files do not contain new lines, which (to my understanding) would cause the last record in a file to become merged with the first record of the next file.
What solutions are there that would allow me to concatenate my text files while ensuring there is a single new line character separating them?

A simple
sort -u *.db > uniquified # adjust glob as needed
should do it; sort will interpose newlines between files should it be necessary.
cat *.db | sort -u
is a classic UUoC and the glitch with files lacking trailing newlines is not the only issue.
Having said that, 25GB probably won't fit in your RAM, so sort will end up creating temporary files anyway. It might turn out to be faster to sort the files in four or five groups, and then merge the results. That could take better advantage of the large number of duplicates. But I'd only experiment if the simple command really takes an exorbitant amount of time.
Even so, sorting the files individually is probably even slower; usually the best bet is to max out your memory resources for each invocation of sort. You could, for example, use xargs with the -n option to split the filelist into groups of a couple of dozen files each. Once you have each group sorted, you could use sort -m to merge the sorted temporaries.
A couple of notes on how to improve sorting speed:
Use LC_COLLATE=C sort if you don't need locale-aware sorting of alphabetic data. That typically speeds sort up by a factor of three or four.
Avoid using RAM disks for temporary space. (On many Linux distros, /tmp is a RAM disk.) Since sort uses temporary disks when it runs out of RAM, putting the temporary in a RAMdisk is counterproductive. For the same reason, don't put your own temporary output files in /tmp. /var/tmp should be real disk; even better, if possible, use a second disk drive (not a slow USB drive, of course).
Avoid slugging your machine down with excessive swapping while you're doing the sort, by turning swap off:
sudo swapoff -a
You can turn it back on afterwards, although I personally run my machine like this all the time because it avoids diving into complete unresponsiveness under memory pressure.
The ideal is to adjust -S so that sort uses as much memory as you can spare, and avoid the use of internal temporaries by sorting in chunks which fit into that amount of memory. (Merging the sorted chunks is a lot faster than sorting, and it reads and writes sequentially without needing extra disk space.) You'll probably need to do some experimentation to find a good chunk size.

I would tell you to create that file by concatenating all input files and intercalating a new line in the middle:
out=newfile.txt
rm -f "$out"
for f in *.txt
do
cat "$f" >> "$out"
echo >> "$out"
done
Now you can sort it. Or remove empty lines, in case you think there could be some input file with a new line at the end.

You can use awk.
$ od -t x1 file1
0000000 72 65 63 6f 72 64 31 0a 72 65 63 6f 72 64 32
0000017
$ od -t x1 file2
0000000 72 65 63 6f 72 64 31 0a 72 65 63 6f 72 64 32 0a
0000020 72 65 63 6f 72 64 33
0000027
$ awk 1 file1 file2
record1
record2
record1
record2
record3
1 is awk script here, which means print all records

sort * should be all you need but just in case you every do need to append newlines to file contents for processing by a subsequent tool, here's how to do that:
$ ls
file1 file2
$ cat file1
foo$
$ cat file2
bar$
$ cat file1 file2
foobar$
$ find . -type f -exec sh -c '(cat {}; printf "\\n")' \;
foo
bar
That is, of course, assuming that your cat can handle files that don't end in newlines!

Write non-ASCII characters to stdin of a progam on a tty (over ssh)

I'm SSHd into a remote server with some binary challenges.
At one point it asks for me to input text. I know it's using fgets() to read stdin, and that I can overflow where it is being copied to and overwrite a nearby variable.
The problem is I don't know how to type the address values I need, \x84 and \x04 etc. If I were able to use bash I would echo -ne "\x84" or use a C hex array but I can't do that kind of thing here.
I've tried using a hex to ASCII converter and copying the binary character and also using expect script to send the binary values but both have the same issue. x84 adds an extra char and x04 won't write at all.
Any idea the best way to reliably write values to memory that can't be represented by ASCII characters over ssh on a Unix tty?

You can write characters in the range 0x00 to 0x1f by typing control characters on the keyboard. Hold down the Control key while typing a character in the third column of this ASCII Chart to get the corresponding code in the first column.
Some control characters have special meaning to the terminal driver (e.g. Ctl-c to kill the process), you can type them literally by preceding with Ctl-v.
And you can get 0x7f by typing Ctl-vDelete (this is the backward delete character, which might be labeld Backspace, not the forward delete that may be in a separate keypad).
I'm not sure if there's any easy way to type characters above this. Depending on the settings in your terminal emulator, you might be able to get the high bit set by holding the Alt key while typing the corresponding ASCII character. E.g. Alt-A would send 0xc1 (0x80 | 0x41).

For high characters, you can probably copy/paste.
e.g. echo -ne "\x84" | xclip -i then middle-click in your terminal emulator, if your desktop is also running Linux. (or maybe not, see below). Or echo ... | ssh user#host could work.
ssh -T or the equivalent in any other terminal emulator might also be an option, to "Disable pseudo-terminal allocation", so the program on the remote side would have its stdin be a pipe from sshd, rather than a pseudo-terminal, I think. This would disable stuff like ^s and ^v being special, I think.
Conversely, echo foo | ssh -tt will force it to request a tty on the remote side, even though you're piping stuff into ssh.
A less-intrusive way to make sure binary data over SSH gets through the TTY layer and to the stdin of the receiving program is to precede every byte with a control-v (hex 0x16).
As Barmar points out, that's the literal-next character (lnext in stty -a output). You can use it before every byte of your payload; it's stripped by the TTY layer in the receiver even before ordinary characters.
# LANG=C sed to replace every byte with ^V byte
# using bash $'' to put a literal control-V on sed's command line
echo "hello" | LANG=C sed $'s/./\x16&/g' | hd
16 68 16 65 16 6c 16 6c 16 6f 0a |.h.e.l.l.o.|
You can test all this locally by typing into hexdump -C (aka hd). Just run it in a terminal, and type or paste some stuff, then control-D until it exits.
$ echo -ne "\x01\xff\x99" | LANG=C sed $'s/./\x16&/g' | hd
00000000 16 01 16 ff 16 99 |......|
00000006
# yup, correct bytes from sed
$ echo -ne "\x01\xff\x99" | LANG=C sed $'s/./\x16&/g' | xclip -i
$ LANG=C hd
^A�� (middle click, return, control-d)
00000000 01 ef bf bd ef bf bd 0a |........|
# nope, that munged my data :/
$ xclip -o | hd
00000000 16 01 16 ff 16 99 |......|
So the X selection itself is correct, but it's getting munged either by Konsole as I paste, or by the TTY layer on the way from Konsole to hexdump? The latter seems unlikely; probably it's a paste problem. Konsole's "encoding" setting is UTF-8. It doesn't seem to have a plain ASCII setting.
Maybe try with LANG=C xterm or something, or just script expect correctly to send actual binary data to ssh, rather than escape codes.
fgets of course does not process escape sequences, any more than strcpy would. In general C functions don't; but in C the compiler processes escape sequences inside string literals at compile time.

printing part of file

Is there a magic unix command for printing part of a file? I have a file that has several millions of lines and I would like to skip first million or so lines and print the next million lines of the file.
Thank you in advance.

To extract data, sed is your friend.
Assuming a 1-off task that you can enter to your cmd-line:
sed -n '200000,300000p' file | enscript
"number comma (,) number" is one form of a range cmd in sed. This one starts at line 2,000,000 and *p*rints until you get to 3,000,000.
If you want the output to go to your screen remove the | enscript
enscript is a utility that manages the process of sending data to Postscript compatible printers. My Linux distro doesn't have that, so its not necessarily a std utility. Hopefully you know what command you need to redirect to to get output printed to paper.
If you want to "print" to another file, use
sed -n '200000,300000p' file > smallerFile
IHTH

I would suggest awk as it is a little easier and more flexible than sed:
awk 'FNR>12 && FNR<23' file
where FNR is the record number. So the above prints lines above 12 and below 23.
And you can make it more specific like this:
awk 'FNR<100 || FNR >990' file
which prints lines if the record number is less than 100 or over 990. Or, lines over 100 and lines containing "fred"
awk 'FNR >100 || /fred/' file

Unix command-line tool for randomly shuffling chunks of bytes in a binary file?

Is there an easy way of shuffling randomly a fixed-size of byte chunks?
I have a large binary file (say, a hundreds of gigabytes) containing many fixed-size of bytes. I do not care about the randomness, but want to shuffle two-byte (or could be any fixed-size of bytes, up to 8) elements in the binary file. Is there a way of combining unix core tools to achieve this goal? If there is no such tool, I might have to develop a C code. I want to hear what recommendation people have.

Here's a stupid shell trick to do so.
First, break the file down two 2 byte chunks using xxd
Shuffle it with shuf
Reassemble the file using xxd.
eg.
xxd -p -c 2 input_file | shuf - | xxd -p -r - output_file
I haven't tested it on huge files. You may want to use an intermediary file.
Alternately, you could use sort -R like so:
xxd -c 2 in_file |sort -R | cut -d' ' -f 2 | xxd -r -p - out_file
This depends on xxd outputing offsets, which should sort differently for each line.

Given the size of the input files to work with, this is a sufficiently complex problem. I wouldn't try to push the limits of shell scripting, best to code this in C or other.
I'm not aware of a tool that can make this easy.

Try:
split -b $CHUNK_SIZE $FILE && find . -name "x*" | perl -MList::Util='shuffle' -e "print shuffle<>" | xargs cat > temp.bin
This creates a large amount of files each with a file size of $CHUNK_SIZE (or less, if the total file size doesn't divide by $CHUNK_SIZE), named xaa, xab, xac, etc., lists the files, shuffles the list, and joins them.
This will take up an extra 2 x of disk space and probably won't work with large files.

Changing a file

My program was writing two values to a file side by side - (as two columns) By some stupid programming mistake I have managed to the following
fprintf(network_energy_delta_E_BM, "%f\n\t %f\n", delta_network_energy_BM,
network_energy_initial);
"%f\n\t %f\n"
Means that my data ended up looking a bit like this:
1234
56
24
99
and so on and so forth.
This causing some problems for me and what I need to do.. Is there any way to amend the file so that they are side by side? I tried
paste network_energy_delta_E_BM.dat foo.dat | awk '{ print $1 }' > new.dat
where network_energy_delta_E_BM.dat is my file and foo.dat is just an empty file. But it takes all the entries and just puts them into one column. Can anyone help fix these please?
There are over 2000000 entries so I can't fix this by hand.
Or is there any way of taking this new file new.dat and taking every second entry and writing those to a new column?
Thank you

What about using this awk expression?
awk '!(NR%2){print p,$0*1; next}{p=$0}' file
It joins two lines into one, where NR means number of record (line, in this case). Note that $0*1 is used to delete the trailing tab of the 2nd line.
Test
$ cat a
1234
56
24
99
$ awk '!(NR%2){print p,$0*1; next}{p=$0}' a
1234 56
24 99

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight