Concatenate text files, separating them with a new line - database

I have just over 100 text files in a directory, functioning as a simple database with each line containing one record. In total, these files add up to around 25GB. However, the records are not sorted alphabetically and there are many duplicates, so in order to alphabetise the contents of all ~100 text files using something like sort -u, I'm first trying to combine all of these files into a single large text file. A simple cat would be unsuitable because the beginnings and ends of the 100 text files do not contain new lines, which (to my understanding) would cause the last record in a file to become merged with the first record of the next file.
What solutions are there that would allow me to concatenate my text files while ensuring there is a single new line character separating them?

A simple
sort -u *.db > uniquified # adjust glob as needed
should do it; sort will interpose newlines between files should it be necessary.
cat *.db | sort -u
is a classic UUoC and the glitch with files lacking trailing newlines is not the only issue.
Having said that, 25GB probably won't fit in your RAM, so sort will end up creating temporary files anyway. It might turn out to be faster to sort the files in four or five groups, and then merge the results. That could take better advantage of the large number of duplicates. But I'd only experiment if the simple command really takes an exorbitant amount of time.
Even so, sorting the files individually is probably even slower; usually the best bet is to max out your memory resources for each invocation of sort. You could, for example, use xargs with the -n option to split the filelist into groups of a couple of dozen files each. Once you have each group sorted, you could use sort -m to merge the sorted temporaries.
A couple of notes on how to improve sorting speed:
Use LC_COLLATE=C sort if you don't need locale-aware sorting of alphabetic data. That typically speeds sort up by a factor of three or four.
Avoid using RAM disks for temporary space. (On many Linux distros, /tmp is a RAM disk.) Since sort uses temporary disks when it runs out of RAM, putting the temporary in a RAMdisk is counterproductive. For the same reason, don't put your own temporary output files in /tmp. /var/tmp should be real disk; even better, if possible, use a second disk drive (not a slow USB drive, of course).
Avoid slugging your machine down with excessive swapping while you're doing the sort, by turning swap off:
sudo swapoff -a
You can turn it back on afterwards, although I personally run my machine like this all the time because it avoids diving into complete unresponsiveness under memory pressure.
The ideal is to adjust -S so that sort uses as much memory as you can spare, and avoid the use of internal temporaries by sorting in chunks which fit into that amount of memory. (Merging the sorted chunks is a lot faster than sorting, and it reads and writes sequentially without needing extra disk space.) You'll probably need to do some experimentation to find a good chunk size.

I would tell you to create that file by concatenating all input files and intercalating a new line in the middle:
out=newfile.txt
rm -f "$out"
for f in *.txt
do
cat "$f" >> "$out"
echo >> "$out"
done
Now you can sort it. Or remove empty lines, in case you think there could be some input file with a new line at the end.

You can use awk.
$ od -t x1 file1
0000000 72 65 63 6f 72 64 31 0a 72 65 63 6f 72 64 32
0000017
$ od -t x1 file2
0000000 72 65 63 6f 72 64 31 0a 72 65 63 6f 72 64 32 0a
0000020 72 65 63 6f 72 64 33
0000027
$ awk 1 file1 file2
record1
record2
record1
record2
record3
1 is awk script here, which means print all records

sort * should be all you need but just in case you every do need to append newlines to file contents for processing by a subsequent tool, here's how to do that:
$ ls
file1 file2
$ cat file1
foo$
$ cat file2
bar$
$ cat file1 file2
foobar$
$ find . -type f -exec sh -c '(cat {}; printf "\\n")' \;
foo
bar
That is, of course, assuming that your cat can handle files that don't end in newlines!

Related

Why is awk seemingly failing to completely store a very large array?

I'm using the following code to identify the 99th percentile of a list of numbers:
sort -n file.txt | awk '{all[NR] = $0} END{print all[int(NR*0.99 - 0.01)]}'
where file.txt looks like:
50
150
75
313
etc.
This works fine for small (20,000) line files but for bigger ones (140-790 million lines) it is giving a smaller number than when you just sort the text file into a new file, and go down to the appropriate line. As far as I can tell awk isn't giving an error that it's run out of memory, but just prints out a number at around the 40th percentile for a 140M line file. I can implement this step a different way, but I was just wondering if anyone knew why this might be happening? From looking at other posts it doesn't seem like awk has a maximum array size so not sure what's happening.

Write non-ASCII characters to stdin of a progam on a tty (over ssh)

I'm SSHd into a remote server with some binary challenges.
At one point it asks for me to input text. I know it's using fgets() to read stdin, and that I can overflow where it is being copied to and overwrite a nearby variable.
The problem is I don't know how to type the address values I need, \x84 and \x04 etc. If I were able to use bash I would echo -ne "\x84" or use a C hex array but I can't do that kind of thing here.
I've tried using a hex to ASCII converter and copying the binary character and also using expect script to send the binary values but both have the same issue. x84 adds an extra char and x04 won't write at all.
Any idea the best way to reliably write values to memory that can't be represented by ASCII characters over ssh on a Unix tty?
You can write characters in the range 0x00 to 0x1f by typing control characters on the keyboard. Hold down the Control key while typing a character in the third column of this ASCII Chart to get the corresponding code in the first column.
Some control characters have special meaning to the terminal driver (e.g. Ctl-c to kill the process), you can type them literally by preceding with Ctl-v.
And you can get 0x7f by typing Ctl-vDelete (this is the backward delete character, which might be labeld Backspace, not the forward delete that may be in a separate keypad).
I'm not sure if there's any easy way to type characters above this. Depending on the settings in your terminal emulator, you might be able to get the high bit set by holding the Alt key while typing the corresponding ASCII character. E.g. Alt-A would send 0xc1 (0x80 | 0x41).
For high characters, you can probably copy/paste.
e.g. echo -ne "\x84" | xclip -i then middle-click in your terminal emulator, if your desktop is also running Linux. (or maybe not, see below). Or echo ... | ssh user#host could work.
ssh -T or the equivalent in any other terminal emulator might also be an option, to "Disable pseudo-terminal allocation", so the program on the remote side would have its stdin be a pipe from sshd, rather than a pseudo-terminal, I think. This would disable stuff like ^s and ^v being special, I think.
Conversely, echo foo | ssh -tt will force it to request a tty on the remote side, even though you're piping stuff into ssh.
A less-intrusive way to make sure binary data over SSH gets through the TTY layer and to the stdin of the receiving program is to precede every byte with a control-v (hex 0x16).
As Barmar points out, that's the literal-next character (lnext in stty -a output). You can use it before every byte of your payload; it's stripped by the TTY layer in the receiver even before ordinary characters.
# LANG=C sed to replace every byte with ^V byte
# using bash $'' to put a literal control-V on sed's command line
echo "hello" | LANG=C sed $'s/./\x16&/g' | hd
16 68 16 65 16 6c 16 6c 16 6f 0a |.h.e.l.l.o.|
You can test all this locally by typing into hexdump -C (aka hd). Just run it in a terminal, and type or paste some stuff, then control-D until it exits.
$ echo -ne "\x01\xff\x99" | LANG=C sed $'s/./\x16&/g' | hd
00000000 16 01 16 ff 16 99 |......|
00000006
# yup, correct bytes from sed
$ echo -ne "\x01\xff\x99" | LANG=C sed $'s/./\x16&/g' | xclip -i
$ LANG=C hd
^A�� (middle click, return, control-d)
00000000 01 ef bf bd ef bf bd 0a |........|
# nope, that munged my data :/
$ xclip -o | hd
00000000 16 01 16 ff 16 99 |......|
So the X selection itself is correct, but it's getting munged either by Konsole as I paste, or by the TTY layer on the way from Konsole to hexdump? The latter seems unlikely; probably it's a paste problem. Konsole's "encoding" setting is UTF-8. It doesn't seem to have a plain ASCII setting.
Maybe try with LANG=C xterm or something, or just script expect correctly to send actual binary data to ssh, rather than escape codes.
fgets of course does not process escape sequences, any more than strcpy would. In general C functions don't; but in C the compiler processes escape sequences inside string literals at compile time.

Linux redirect to multiple targets

How could I redirect output to multiple targets, say stdout, file, socket and so?
say, i have a system here and connected to some network. When it fails, the guy supervises it via ssh should be able to notice it, or the GUI client should receive the error info, or, in the worst case, we can still find something in the log.
or even more targets. Atomicity may or may not need to be guaranteed.
so how to do this in bash and/or in C?
I think you are looking for the "tee" command.
You can redirect with tee to any number of files and to any commands too, like:
seq 50 | tee copy1 copy2 >((echo Original linecount: $(grep -c ''))>&2) | grep '9'
what prints:
9
19
29
39
49
Original linecount: 50 #printed to stderr
or
seq 50 | tee copy1 copy2 >((echo Original linecount: $(grep -c ''))>&2) | grep '9' | wc -l
what prints the count of numbers containing digit 9 in first 50 numbers, while make two copyes of the original sequence...
Original linecount: 50 #stderr
5

Unix command-line tool for randomly shuffling chunks of bytes in a binary file?

Is there an easy way of shuffling randomly a fixed-size of byte chunks?
I have a large binary file (say, a hundreds of gigabytes) containing many fixed-size of bytes. I do not care about the randomness, but want to shuffle two-byte (or could be any fixed-size of bytes, up to 8) elements in the binary file. Is there a way of combining unix core tools to achieve this goal? If there is no such tool, I might have to develop a C code. I want to hear what recommendation people have.
Here's a stupid shell trick to do so.
First, break the file down two 2 byte chunks using xxd
Shuffle it with shuf
Reassemble the file using xxd.
eg.
xxd -p -c 2 input_file | shuf - | xxd -p -r - output_file
I haven't tested it on huge files. You may want to use an intermediary file.
Alternately, you could use sort -R like so:
xxd -c 2 in_file |sort -R | cut -d' ' -f 2 | xxd -r -p - out_file
This depends on xxd outputing offsets, which should sort differently for each line.
Given the size of the input files to work with, this is a sufficiently complex problem. I wouldn't try to push the limits of shell scripting, best to code this in C or other.
I'm not aware of a tool that can make this easy.
Try:
split -b $CHUNK_SIZE $FILE && find . -name "x*" | perl -MList::Util='shuffle' -e "print shuffle<>" | xargs cat > temp.bin
This creates a large amount of files each with a file size of $CHUNK_SIZE (or less, if the total file size doesn't divide by $CHUNK_SIZE), named xaa, xab, xac, etc., lists the files, shuffles the list, and joins them.
This will take up an extra 2 x of disk space and probably won't work with large files.

how to extract specific bytes from a file using unix

how do I extract 12byte chunks from a binary file at certain positions within the file.
If I wanted to extract the first 12 bytes I could do something like
head -c12 file.bin>output
If I wanted to extract 12 bytes from byte61 I could do something like
head -c72 file.bin|tail -c12 >output
Is there a simpler way if I have something like 20 12byte chunks I need to extract
thanks
Use dd:
dd bs=1 seek=60 count=12 if=file.bin of=output
You can write a shell loop to substitute the numbers.
You could also consider using awk, Perl or Python, if there's a lot of them to do or it needs to be really fast.
Using xxd:
xxd -p -seek 3d -l 12 file.bin > output
3d means 61 in hexadecimal
Using hexdump:
hexdump -ve '16/1 "%0.2x " "\n"' -s 3d -n 12 file.bin > output

Resources