Pick 20 records each time and transpose from a big file - loops

I have a big file with 1 column and 800,000 rows
Example:
123
234
...
5677
222
444
I want to transpose it into 20 numbers per line.
Example:
123,234,....
5677,
222,
444,....
I tried using while loop like this
while [ $(wc -l < list.dat) -ge 1 ]
do
cat list.dat | head -20 | awk -vORS=, '{ print $1 }'| sed 's/,$/\n/' >> sample1.dat
sed -i -e '1,20d' list.dat
done
but this is insanely slow.
Can anyone suggest a faster solution?

pr is the right tool for this, for example:
$ seq 100 | pr -20ats,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40
41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60
61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80
81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
For your file, try pr -20ats, list.dat
Based on width of column text, you might run into the error pr: page width too narrow. In that case, try:
$ seq 100000 100100 | pr -40ats,
pr: page width too narrow
$ seq 100000 100100 | pr -J -W79 -40ats,
100000,100001,100002,100003,100004,100005,100006,100007,100008,100009,100010,100011,100012,100013,100014,100015,100016,100017,100018,100019,100020,100021,100022,100023,100024,100025,100026,100027,100028,100029,100030,100031,100032,100033,100034,100035,100036,100037,100038,100039
100040,100041,100042,100043,100044,100045,100046,100047,100048,100049,100050,100051,100052,100053,100054,100055,100056,100057,100058,100059,100060,100061,100062,100063,100064,100065,100066,100067,100068,100069,100070,100071,100072,100073,100074,100075,100076,100077,100078,100079
100080,100081,100082,100083,100084,100085,100086,100087,100088,100089,100090,100091,100092,100093,100094,100095,100096,100097,100098,100099,100100
Formula for -W value is (col-1)*len(delimiter) + col where col is number of columns required
From man pr
pr - convert text files for printing
-a, --across
print columns across rather than down, used together with -COLUMN
-t, --omit-header
omit page headers and trailers; implied if PAGE_LENGTH <= 10
-s[CHAR], --separator[=CHAR]
separate columns by a single character, default for CHAR is the character without -w and 'no
char' with -w. -s[CHAR] turns off line truncation of all 3 column options (-COLUMN|-a -COLUMN|-m)
except -w is set
-COLUMN, --columns=COLUMN
output COLUMN columns and print columns down, unless -a is used. Balance number of lines in the columns
on each page
-J, --join-lines
merge full lines, turns off -W line truncation, no column alignment, --sep-string[=STRING] sets separa‐
tors
-W, --page-width=PAGE_WIDTH
set page width to PAGE_WIDTH (72) characters always, truncate lines, except -J option is set, no inter‐
ference with -S or -s
See also Why is using a shell loop to process text considered bad practice?

If you don't wish to use any other external binaries, you can refer the below SO link answering a similar question in depth.
bash: combine five lines of input to each line of output

If you want to use sed:
sed -n '21~20 { x; s/^\n//; s/\n/, /g; p;}; 21~20! H;' list.dat
The first command
21~20 { x; s/^\n//; s/\n/, /g; p;},
is triggered at lines matching 21+(n*20); n>=0. Here everything that was put in the hold space at complement lines via the second command:
21~20! H;
is processed:
x;
puts the content of the hold buffer (20 lines) in the pattern space and places the current line (21+(n*20)) in the hold buffer. In the pattern space:
s/^\n//
removes the trailing new line and:
s/\n/, /g
does the desired substitution.:
p;
prints the now 20-columned row.
After that the next line is read in the hold buffer and the process continues.

Related

Seperate a range into single line adding the values in between

I have a text list with values in a range as follows:
TSNN-00500--00503 TSNN-00510--00515
But I need to separate them into single lines in a text file and add the values in between.
Can this be done with a script easily?
Want the result in a new text file as follow
TSNN-00500
TSNN-00501
TSNN-00502
TSNN-00503
TSNN-00510
TSNN-00511
TSNN-00512
TSNN-00513
TSNN-00514
TSNN-00515
easy way is subjective.
Getting a sequence from a range can be performed with seq:
$ seq -f%05.0f 00500 00503
00500
00501
00502
00503
Getting the range from the string you provide can be performed with bash find-and-replace:
$ text='TSNN-00500--00503'
$ echo ${text//[^0-9]/ }
00500 00503
Prepending a string to a line can be performed with sed:
$ echo 00502 | sed 's/^/TSNN-/'
TSNN-00502
Wrapping everything in a loop gives:
$ for i in TSNN-00500--00503 TSNN-00510--00515 ; do seq -f%05.0f ${i//[^0-9]/ } | sed 's/^/TSNN-/' ; done

Return the number of lines required for paragraphs of text for a given width in Bash

Given a width I am trying to calculate how many lines a block of text, that contains paragraphs (\n line endings), would take.
I cannot simply divide the number of characters by the width because the line endings create new lines early. I cannot count the line endings only because some paragraphs will wrap.
I think that I need to loop through the paragraphs, dividing the characters by the width for each and adding the results together.
count_lines() {
TEXT="$(echo -e $1)"
WIDTH=$2
LINES=0
for i in "${TEXT[#]}"
do
PAR=$(echo -e "$i" | wc -c)
LINES=$LINES + (( $PAR / $WIDTH ))
done
RETURN $LINES
}
Reading the text as an array did not work.
count_lines() {
fmt -w "$2" <<<"$1" | wc -l
}
fmt is a longstanding (present as far back as Plan 9, and part of coreutils on GNU systems) UNIX tool which wraps text to a desired width.
<<< is herestring syntax, a ksh and bash alternative to heredocs which allows their use without splitting a script into multiple lines.
Testing this:
text=$(cat <<'EOF'
This is a sample document with multiple paragraphs. This paragraph is the first one.
This is the second paragraph of the sample document.
EOF
)
count_lines "$text" 20
...returns output of 10. That's correct, because the text wraps as follows (lines at the beginning added for readability):
1 This is a
2 sample document
3 with multiple
4 paragraphs. This
5 paragraph is the
6 first one.
7
8 This is the second
9 paragraph of the
10 sample document.

Sed: Better way to address the n-th line where n are elements of an array

We know that the sed command loops over each line of a file and for each line, it loops over the given commands list and does something. But when the file is extremely large, the time and resource cost on the repeating operation may be terrible.
Suppose that I have an array of line numbers which I want to use as addresses to delete or print with sed command (e.g. A=(20000 30000 50000 90000)) and there is a VERY LARGE object file.
The easiest way may be:
(Remark by #John1024, careful about the line number changes for each loop)
( for NL in ${A[#]}; do sed "$NL d" $very_large_file; done; )>.temp_file;
cp .temp_file $very_large_file; rm .temp_file
The problem of the code above is that, for each indexed line number of the array, it needs to loop over the whole file.
To avoid this, one can:
#COMM=`echo "${A[#]}" | sed 's/\s/d;/g;s/$/d'`;
#sed -i "$COMM" $very_large_file;
#Edited: Better with direct parameter expansion:
sed -i "${A[#]/%/d;}" $very_large_file;
It first print the array and replace its SPACE and the END_OF_LINE with the d command of sed, so that the string looks like "20000d;30000d;50000d;90000d"; on the second line, we treat this string as the command list of sed. The result is that with this code, it only loops over the file for once.
More over, for in-place operation (argument -i), one cannot quit using q with sed even though the greatest line number of interest has passed, because if so, the lines after the that line (e.g. 90001+) will disappear (It seems that the in-place operation is just overwriting the file with stdout).
Better ideas?
(Reply to #user unknown:) I think it could be even more efficient if we manage to "quit" the loop once all indexed lines have passed. We can't, using sed -i, for the aforementioned reasons. Printing each line to a file cost more time than copying a file (e.g. cat file1 > file2 and cp file1 file2). We may benefit from this concept, using any other methods or tools. This is what I expect.
PS: The points of this question are "Lines location" and "Efficiency"; the "delete lines" operation is just an example. For real tasks, there are much more - append/insert/substituting, field separating, cases judgement followed by read from/write to files, calculations etc.
In order words, it may invoke all kind of operations, creating sub-shells or not, caring about the variable passing, ... so, the tools to use should allow me to line processing, and the problem is how to get myself onto the lines of interest, doing all kinds operations.
Any comments are appreciated.
First make a copy to a testfile for checking the results.
You want to sort the linenumbers, highest first.
echo "${a[#]}" | sed 's/\s/\n/g' | sort -rn
You can feed commands into ed using printf:
printf "%s\n" "command1" "command2" w q testfile | ed -s testfile
Combine these
printf "%s\n" $(echo "${a[#]}" | sed 's/\s/\n/g' | sort -rn | sed 's/$/d/') w q |
ed -s testfile
Edit (tx #Ed_Morton):
This can be written in less steps with
printf "%s\n" $(printf '%sd\n' "${a[#]}" | sort -rn ) w q | ed -s testfile
I can not remove the sort, because each delete instruction is counting the linenumbers from 1.
I tried to find a command for editing the file without redirecting to another, but I started with the remark that you should make a copy. I have no choice, I have to upvote the straight forward awk solution that doesn't need a sort.
sed is for doing s/old/new, that is all, and when you add a shell loop to the mix you've really gone off the rails (see https://unix.stackexchange.com/q/169716/133219). To delete lines whose numbers are stored in an array is (using seq to generate input since no sample input/output provided in the question):
$ a=( 3 7 8 )
$ seq 10 |
awk -v a="${a[*]}" 'BEGIN{split(a,tmp); for (i in tmp) nrs[tmp[i]]} !(NR in nrs)'
1
2
4
5
6
9
10
and if you wanted to stop processing with awk once the last target line has been deleted and let tail finish the job then you could figure out the max value in the array up front and then do awk on just the part up to that last target line:
max=$( printf '%s\n' "${a[#]}" | sort -rn | head -1 )
head -"$max" file | awk '...' file > out
tail +"$((max+1))" file >> out
idk if that'd really be any faster than just letting awk process the whole file since awk is very efficient, especially when you're not referencing any fields and so it doesn't do any field splitting, but you could give it a try.
You could generate an intermediate sed command file from your lines.
echo ${A[#]} | sort -n > lines_to_delete
min=`head -1` lines_to_delete
max=`head -1` lines_to_delete
# skip to first and from last line, delete the others
sed -i -e 1d -e ${linecount}d -e 's#$#d#' lines_to_delete
head -${min} input > output
sed -f lines_to_delete input >> output
tail -${max} input >> output
mv output input

Bash print array element characters vertically side by side

I have an array that i want to print vertically but also side by side.
Ex.
I have an array with these elements as follows separated by spaces and each character in the element separated by commas:
0,1,2 3,4,5 6,7,8
I want it to output:
036
147
258
Any help is greatly appreciated!
ary=(0,1,2 3,4,5 6,7,8)
pr -T -"${#ary[#]}" < <(IFS=,; echo "${ary[*]}" | tr , '\n') | tr -d '[:blank:]'
prints
036
147
258
Notes:
the < <(...) syntax is a redirection (first <) of a process substitution
the bit inside the process substitution prints the array elements joined with a comma then translates the commas into newlines
the output of the process substitution (a single column of digits) is redirected into pr.
pr is a handy tool for forcing a column of output into columns.
the -"${#ary[#]}" option tells pr to use the same number of columns as there are array elements.
the output of pr is sent to a second tr, that deletes any horizontal whitespace.
If you want commas, change the second tr to: tr -s '[:blank:]' , use this:
pr -T -s, -"${#ary[#]}" < <(IFS=,; echo "${ary[*]}" | tr , '\n')

Shell script cut the beginning and the end of a file

So I have a file and I'd like cut the first 33 lines and the last 6 lines of it. What I am trying to do is get the whole file in a cat command (cat file) and then use the "head" and "tail" commands to remove those parts, but I don't know how to do so.
Eg (this is just the idea)
cat file - head -n 33 file - tail -n 6 file
How am I supposed to do this? Is it possible to do it with "sed" (how)? Thanks in advance.
This is probably what you want:
$ tail -n +34 file | head -n -6
See the tail
-n, --lines=K
output the last K lines, instead of the last 10; or use -n +K to output lines starting with the Kth
and head
-n, --lines=[-]K
print the first K lines instead of the first 10; with the leading '-', print all but the last K lines of each file
man pages.
Example:
$ cat file
one
two
three
four
five
six
seven
eight
$ tail -n +4 file | head -n -2
four
five
six
Notice that you don't need the cat (see UUOC).
first count total lines, then print the middle part: (read file twice)
l=$(wc -l file)
awk -v l="$l" 'NR>33&&NR<l-6' file
or load the file in array, then print lines you need : (read file once)
awk '{a[NR]=$0}END{for(i=34;i<NR-6;i++)print a[i]}' file
or awk with head, don't think so much in this way: (read file twice):
awk 'NR>33' file|head -n-6
sed -n '1,33b; 34{N;N;N;N;N};N;P;D' file
this will work +
This might work for you (GNU sed):
sed '1,33d;:a;$d;N;s/\n/&/6;Ta;P;D' file

Resources