Pick 20 records each time and transpose from a big file - loops
I have a big file with 1 column and 800,000 rows
Example:
123
234
...
5677
222
444
I want to transpose it into 20 numbers per line.
Example:
123,234,....
5677,
222,
444,....
I tried using while loop like this
while [ $(wc -l < list.dat) -ge 1 ]
do
cat list.dat | head -20 | awk -vORS=, '{ print $1 }'| sed 's/,$/\n/' >> sample1.dat
sed -i -e '1,20d' list.dat
done
but this is insanely slow.
Can anyone suggest a faster solution?
pr is the right tool for this, for example:
$ seq 100 | pr -20ats,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40
41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60
61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80
81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
For your file, try pr -20ats, list.dat
Based on width of column text, you might run into the error pr: page width too narrow. In that case, try:
$ seq 100000 100100 | pr -40ats,
pr: page width too narrow
$ seq 100000 100100 | pr -J -W79 -40ats,
100000,100001,100002,100003,100004,100005,100006,100007,100008,100009,100010,100011,100012,100013,100014,100015,100016,100017,100018,100019,100020,100021,100022,100023,100024,100025,100026,100027,100028,100029,100030,100031,100032,100033,100034,100035,100036,100037,100038,100039
100040,100041,100042,100043,100044,100045,100046,100047,100048,100049,100050,100051,100052,100053,100054,100055,100056,100057,100058,100059,100060,100061,100062,100063,100064,100065,100066,100067,100068,100069,100070,100071,100072,100073,100074,100075,100076,100077,100078,100079
100080,100081,100082,100083,100084,100085,100086,100087,100088,100089,100090,100091,100092,100093,100094,100095,100096,100097,100098,100099,100100
Formula for -W value is (col-1)*len(delimiter) + col where col is number of columns required
From man pr
pr - convert text files for printing
-a, --across
print columns across rather than down, used together with -COLUMN
-t, --omit-header
omit page headers and trailers; implied if PAGE_LENGTH <= 10
-s[CHAR], --separator[=CHAR]
separate columns by a single character, default for CHAR is the character without -w and 'no
char' with -w. -s[CHAR] turns off line truncation of all 3 column options (-COLUMN|-a -COLUMN|-m)
except -w is set
-COLUMN, --columns=COLUMN
output COLUMN columns and print columns down, unless -a is used. Balance number of lines in the columns
on each page
-J, --join-lines
merge full lines, turns off -W line truncation, no column alignment, --sep-string[=STRING] sets separa‐
tors
-W, --page-width=PAGE_WIDTH
set page width to PAGE_WIDTH (72) characters always, truncate lines, except -J option is set, no inter‐
ference with -S or -s
See also Why is using a shell loop to process text considered bad practice?
If you don't wish to use any other external binaries, you can refer the below SO link answering a similar question in depth.
bash: combine five lines of input to each line of output
If you want to use sed:
sed -n '21~20 { x; s/^\n//; s/\n/, /g; p;}; 21~20! H;' list.dat
The first command
21~20 { x; s/^\n//; s/\n/, /g; p;},
is triggered at lines matching 21+(n*20); n>=0. Here everything that was put in the hold space at complement lines via the second command:
21~20! H;
is processed:
x;
puts the content of the hold buffer (20 lines) in the pattern space and places the current line (21+(n*20)) in the hold buffer. In the pattern space:
s/^\n//
removes the trailing new line and:
s/\n/, /g
does the desired substitution.:
p;
prints the now 20-columned row.
After that the next line is read in the hold buffer and the process continues.
Related
Seperate a range into single line adding the values in between
I have a text list with values in a range as follows: TSNN-00500--00503 TSNN-00510--00515 But I need to separate them into single lines in a text file and add the values in between. Can this be done with a script easily? Want the result in a new text file as follow TSNN-00500 TSNN-00501 TSNN-00502 TSNN-00503 TSNN-00510 TSNN-00511 TSNN-00512 TSNN-00513 TSNN-00514 TSNN-00515
easy way is subjective. Getting a sequence from a range can be performed with seq: $ seq -f%05.0f 00500 00503 00500 00501 00502 00503 Getting the range from the string you provide can be performed with bash find-and-replace: $ text='TSNN-00500--00503' $ echo ${text//[^0-9]/ } 00500 00503 Prepending a string to a line can be performed with sed: $ echo 00502 | sed 's/^/TSNN-/' TSNN-00502 Wrapping everything in a loop gives: $ for i in TSNN-00500--00503 TSNN-00510--00515 ; do seq -f%05.0f ${i//[^0-9]/ } | sed 's/^/TSNN-/' ; done
Return the number of lines required for paragraphs of text for a given width in Bash
Given a width I am trying to calculate how many lines a block of text, that contains paragraphs (\n line endings), would take. I cannot simply divide the number of characters by the width because the line endings create new lines early. I cannot count the line endings only because some paragraphs will wrap. I think that I need to loop through the paragraphs, dividing the characters by the width for each and adding the results together. count_lines() { TEXT="$(echo -e $1)" WIDTH=$2 LINES=0 for i in "${TEXT[#]}" do PAR=$(echo -e "$i" | wc -c) LINES=$LINES + (( $PAR / $WIDTH )) done RETURN $LINES } Reading the text as an array did not work.
count_lines() { fmt -w "$2" <<<"$1" | wc -l } fmt is a longstanding (present as far back as Plan 9, and part of coreutils on GNU systems) UNIX tool which wraps text to a desired width. <<< is herestring syntax, a ksh and bash alternative to heredocs which allows their use without splitting a script into multiple lines. Testing this: text=$(cat <<'EOF' This is a sample document with multiple paragraphs. This paragraph is the first one. This is the second paragraph of the sample document. EOF ) count_lines "$text" 20 ...returns output of 10. That's correct, because the text wraps as follows (lines at the beginning added for readability): 1 This is a 2 sample document 3 with multiple 4 paragraphs. This 5 paragraph is the 6 first one. 7 8 This is the second 9 paragraph of the 10 sample document.
Sed: Better way to address the n-th line where n are elements of an array
We know that the sed command loops over each line of a file and for each line, it loops over the given commands list and does something. But when the file is extremely large, the time and resource cost on the repeating operation may be terrible. Suppose that I have an array of line numbers which I want to use as addresses to delete or print with sed command (e.g. A=(20000 30000 50000 90000)) and there is a VERY LARGE object file. The easiest way may be: (Remark by #John1024, careful about the line number changes for each loop) ( for NL in ${A[#]}; do sed "$NL d" $very_large_file; done; )>.temp_file; cp .temp_file $very_large_file; rm .temp_file The problem of the code above is that, for each indexed line number of the array, it needs to loop over the whole file. To avoid this, one can: #COMM=`echo "${A[#]}" | sed 's/\s/d;/g;s/$/d'`; #sed -i "$COMM" $very_large_file; #Edited: Better with direct parameter expansion: sed -i "${A[#]/%/d;}" $very_large_file; It first print the array and replace its SPACE and the END_OF_LINE with the d command of sed, so that the string looks like "20000d;30000d;50000d;90000d"; on the second line, we treat this string as the command list of sed. The result is that with this code, it only loops over the file for once. More over, for in-place operation (argument -i), one cannot quit using q with sed even though the greatest line number of interest has passed, because if so, the lines after the that line (e.g. 90001+) will disappear (It seems that the in-place operation is just overwriting the file with stdout). Better ideas? (Reply to #user unknown:) I think it could be even more efficient if we manage to "quit" the loop once all indexed lines have passed. We can't, using sed -i, for the aforementioned reasons. Printing each line to a file cost more time than copying a file (e.g. cat file1 > file2 and cp file1 file2). We may benefit from this concept, using any other methods or tools. This is what I expect. PS: The points of this question are "Lines location" and "Efficiency"; the "delete lines" operation is just an example. For real tasks, there are much more - append/insert/substituting, field separating, cases judgement followed by read from/write to files, calculations etc. In order words, it may invoke all kind of operations, creating sub-shells or not, caring about the variable passing, ... so, the tools to use should allow me to line processing, and the problem is how to get myself onto the lines of interest, doing all kinds operations. Any comments are appreciated.
First make a copy to a testfile for checking the results. You want to sort the linenumbers, highest first. echo "${a[#]}" | sed 's/\s/\n/g' | sort -rn You can feed commands into ed using printf: printf "%s\n" "command1" "command2" w q testfile | ed -s testfile Combine these printf "%s\n" $(echo "${a[#]}" | sed 's/\s/\n/g' | sort -rn | sed 's/$/d/') w q | ed -s testfile Edit (tx #Ed_Morton): This can be written in less steps with printf "%s\n" $(printf '%sd\n' "${a[#]}" | sort -rn ) w q | ed -s testfile I can not remove the sort, because each delete instruction is counting the linenumbers from 1. I tried to find a command for editing the file without redirecting to another, but I started with the remark that you should make a copy. I have no choice, I have to upvote the straight forward awk solution that doesn't need a sort.
sed is for doing s/old/new, that is all, and when you add a shell loop to the mix you've really gone off the rails (see https://unix.stackexchange.com/q/169716/133219). To delete lines whose numbers are stored in an array is (using seq to generate input since no sample input/output provided in the question): $ a=( 3 7 8 ) $ seq 10 | awk -v a="${a[*]}" 'BEGIN{split(a,tmp); for (i in tmp) nrs[tmp[i]]} !(NR in nrs)' 1 2 4 5 6 9 10 and if you wanted to stop processing with awk once the last target line has been deleted and let tail finish the job then you could figure out the max value in the array up front and then do awk on just the part up to that last target line: max=$( printf '%s\n' "${a[#]}" | sort -rn | head -1 ) head -"$max" file | awk '...' file > out tail +"$((max+1))" file >> out idk if that'd really be any faster than just letting awk process the whole file since awk is very efficient, especially when you're not referencing any fields and so it doesn't do any field splitting, but you could give it a try.
You could generate an intermediate sed command file from your lines. echo ${A[#]} | sort -n > lines_to_delete min=`head -1` lines_to_delete max=`head -1` lines_to_delete # skip to first and from last line, delete the others sed -i -e 1d -e ${linecount}d -e 's#$#d#' lines_to_delete head -${min} input > output sed -f lines_to_delete input >> output tail -${max} input >> output mv output input
Bash print array element characters vertically side by side
I have an array that i want to print vertically but also side by side. Ex. I have an array with these elements as follows separated by spaces and each character in the element separated by commas: 0,1,2 3,4,5 6,7,8 I want it to output: 036 147 258 Any help is greatly appreciated!
ary=(0,1,2 3,4,5 6,7,8) pr -T -"${#ary[#]}" < <(IFS=,; echo "${ary[*]}" | tr , '\n') | tr -d '[:blank:]' prints 036 147 258 Notes: the < <(...) syntax is a redirection (first <) of a process substitution the bit inside the process substitution prints the array elements joined with a comma then translates the commas into newlines the output of the process substitution (a single column of digits) is redirected into pr. pr is a handy tool for forcing a column of output into columns. the -"${#ary[#]}" option tells pr to use the same number of columns as there are array elements. the output of pr is sent to a second tr, that deletes any horizontal whitespace. If you want commas, change the second tr to: tr -s '[:blank:]' , use this: pr -T -s, -"${#ary[#]}" < <(IFS=,; echo "${ary[*]}" | tr , '\n')
Shell script cut the beginning and the end of a file
So I have a file and I'd like cut the first 33 lines and the last 6 lines of it. What I am trying to do is get the whole file in a cat command (cat file) and then use the "head" and "tail" commands to remove those parts, but I don't know how to do so. Eg (this is just the idea) cat file - head -n 33 file - tail -n 6 file How am I supposed to do this? Is it possible to do it with "sed" (how)? Thanks in advance.
This is probably what you want: $ tail -n +34 file | head -n -6 See the tail -n, --lines=K output the last K lines, instead of the last 10; or use -n +K to output lines starting with the Kth and head -n, --lines=[-]K print the first K lines instead of the first 10; with the leading '-', print all but the last K lines of each file man pages. Example: $ cat file one two three four five six seven eight $ tail -n +4 file | head -n -2 four five six Notice that you don't need the cat (see UUOC).
first count total lines, then print the middle part: (read file twice) l=$(wc -l file) awk -v l="$l" 'NR>33&&NR<l-6' file or load the file in array, then print lines you need : (read file once) awk '{a[NR]=$0}END{for(i=34;i<NR-6;i++)print a[i]}' file or awk with head, don't think so much in this way: (read file twice): awk 'NR>33' file|head -n-6
sed -n '1,33b; 34{N;N;N;N;N};N;P;D' file this will work +
This might work for you (GNU sed): sed '1,33d;:a;$d;N;s/\n/&/6;Ta;P;D' file