How can I avoid newlines after array elements when using readarray? - arrays

I've got a text file
Today, 12:34
Today, 21:43
Today, 12:43
https://123
https://456
https://789
and wanted to print each line into an array. Therefore I used:
readarray array <'file.txt'
Now I'd like to create a new array mixing date and the corresponding link, so in this case, line 1 corresponds with line 4 and so on.
I wrote
declare -a array2
array2[0]=${array[0]}${array[3]}
array2[1]=${array[1]}${array[4]}
...
printing the whole array2 using "echo ${array2[*]}" gets the following:
Today, 12:34
https://123
Today, 21:43
https://456
Today, 12:43
https://789
Why are there newlines between the elements, so e.g. between array2[0] and array2[1] ? How could I get rid of them?
And why is there an empty space before T in the second and the following lines?
And is there a possibility to write the code above in a loop?
Kind regards,
X3nion

Use the -t argument to prevent the newlines from being included in the data stored in the individual array elements:
readarray -t array <file.txt
BTW, you can always strip your newlines after-the-fact, even if you don't prevent them from being read in the first place, by using the ${var%suffix} parameter expansion with the $'\n' syntax to refer to a newline literal:
array2[0]=${array[0]%$'\n'}${array[3]%$'\n'}

An awk solution would be much simpler (and much faster). Simply read all lines containing "Today" into an array in awk. Then beginning with the line not containing "Today" write the current line followed by the associated line from the array, e.g.
awk '/Today/{a[++n] = $0; next} {printf "%s\t%s\n", $0, a[++m]}' file.txt
Example Use/Output
With your example lines in file.txt, you would receive:
$ awk '/Today/{a[++n] = $0; next} {printf "%s\t%s\n", $0, a[++m]}' file.txt
https://123 Today, 12:34
https://456 Today, 21:43
https://789 Today, 12:43
Or if you wanted to change the order:
$ awk '/Today/{a[++n] = $0; next} {printf "%s\t%s\n", a[++m], $0}' file.txt
Today, 12:34 https://123
Today, 21:43 https://456
Today, 12:43 https://789
Addition Per-Comment
If you are receiving whitespace before the output with awk that is due to having whitespace before the first field. To eliminate the whitespace, you can force awk to recalculate each field, removing whitespace simply by setting a field equal to itself, e.g.
awk '{$1 = $1} /Today/{a[++n] = $0; next} {printf "%s\t%s\n", a[++m], $0}' file.txt
By setting the first field equal to itself ($1 = $1), you force awk to recalculate each field which would eliminate leading whitespace. Take for example your data with leading whitespace (each line is preceded by 3-spaces):
Today, 12:34
Today, 21:43
Today, 12:43
https://123
https://456
https://789
Using the updated command gives the answers shown above with the whitespace removed.
Using paste
You can use the paste command as another option along with the wc -l (word count lines) command. Simply determined the number of lines and then use process substitution to output the first 1/2 of the lines followed by the last 1/2 of the lines and combine them with paste, e.g.
$ lc=$(wc -l <file.txt); paste <(head -n $((lc/2)) file.txt) <(tail -n $((lc/2)) file.txt)
Today, 12:34 https://123
Today, 21:43 https://456
Today, 12:43 https://789
(above, lc holds the line-count and then head and tail are used to split the file)
Let me know if you have questions or if this isn't what you were attempting to do.

Related

Bash RegEx and Storing Commands into a Variable

in Bash I have an array names that contains the string values
Dr. Praveen Hishnadas
Dr. Vij Pamy
John Smitherson,Dr.,Service Account
John Dinkleberg,Dr.,Service Account
I want to capture only the names
Praveen Hishnadas
Vij Pamy
John Smitherson
John Dinkleberg
and store them back into the original array, overwriting their unsanitized versions.
I have the following snippet of code note that I'm executing the regex in Perl (-P)
for i in "${names[#]}"
do
echo $i|grep -P '(?:Dr\.)?\w+ \w+|$' -o | head -1
done
Which yields the output
Dr. Praveen Hishnadas
Dr. Vij Pamy
John Smitherson
John Dinkleberg
Questions:
1) Am I using the look-around command ?: incorrectly? I'm trying to optionally match "Dr." while
not capturing it
2) How would I store the result of that echo back into the array names? I have tried setting it to
i=echo $i|grep -P '(?:Dr\.)?\w+ \w+|$' -o | head -1
i=$(echo $i|grep -P '(?:Dr\.)?\w+ \w+|$' -o | head -1)
i=`echo $i|grep -P '(?:Dr\.)?\w+ \w+|$' -o | head -1`
but to no avail. I only started learning bash 2 days ago and I feel like my syntaxing is slightly off. Any help is appreciated.
Your lookahead says "include Dr. if it's there". You probably want a negative lookahead like (?!Dr\.)\w+ \w+. I'll throw in a leading \b anchor a a bonus.
names=('Dr. Praveen Hishnadas' 'Dr. Vij Pamy' 'John Smitherson,Dr.,Service Account' 'John Dinkleberg,Dr.,Service Account')
for i in "${names[#]}"
do
grep -P '\b(?!Dr\.)\w+ \w+' -o <<<"$i" |
head -n 1
done
It doesn't matter for the examples you provided, but you should basically always quote your variables. See When to wrap quotes around a shell variable?
Maybe also google "falsehoods programmers believe about names".
To update your array, loop over the array indices and assign back into the array.
for((i=0;i<${#names[#]};++i)); do
names[$i]=$(grep -P '\b(?!Dr\.)\w+ \w+|$' -o <<<"${names[i]}" | head -n 1)
done
How about something like this for the regex?
(?:^|\.\s)(\w+)\s+(\w+)
Regex Demo
(?: # Non-capturing group
^|\.\s # Start match if start of line or following dot+space sequence
)
(\w+) # Group 1 captures the first name
\s+ # Match unlimited number of spaces between first and last name (take + off to match 1 space)
(\w+) # Group 2 captures surname.

printing part of file

Is there a magic unix command for printing part of a file? I have a file that has several millions of lines and I would like to skip first million or so lines and print the next million lines of the file.
Thank you in advance.
To extract data, sed is your friend.
Assuming a 1-off task that you can enter to your cmd-line:
sed -n '200000,300000p' file | enscript
"number comma (,) number" is one form of a range cmd in sed. This one starts at line 2,000,000 and *p*rints until you get to 3,000,000.
If you want the output to go to your screen remove the | enscript
enscript is a utility that manages the process of sending data to Postscript compatible printers. My Linux distro doesn't have that, so its not necessarily a std utility. Hopefully you know what command you need to redirect to to get output printed to paper.
If you want to "print" to another file, use
sed -n '200000,300000p' file > smallerFile
IHTH
I would suggest awk as it is a little easier and more flexible than sed:
awk 'FNR>12 && FNR<23' file
where FNR is the record number. So the above prints lines above 12 and below 23.
And you can make it more specific like this:
awk 'FNR<100 || FNR >990' file
which prints lines if the record number is less than 100 or over 990. Or, lines over 100 and lines containing "fred"
awk 'FNR >100 || /fred/' file

Bash read file to an array based on two delimiters

I have a file that I need to parse into an array but I really only want a brief portion of each line and for only the first 84 lines.
Sometimes the line maybe:
>MT gi...
And I would just want the MT to be entered into the array. Other times it might be something like this:
>GL000207.1 dn...
and I would need the GL000207.1
I was thinking that you might be able to set two delimiters (one being the '>' and the other being the ' ' whitespace) but I am not sure how you would go about it. I have read other peoples posts about the internal field separator but I am really not sure of how that would work. I would think perhaps something like this might work though?
desiredArray=$(echo file.whatever | tr ">" " ")
for x in $desiredArray
do
echo > $x
done
Any suggestions?
How about:
head -84 <file> | awk '{print $1}' | tr -d '>'
head takes only the first lines of the file, awk strips off the first space and everything after it, and tr gets rid of the '>'.
You can also do it with sed:
head -n 84 <file> | sed 's/>\([^ ]*\).*/\1/'

script for getting extensions of a file

I need to get all the file extension types in a folder. For instance, if the directory's ls gives the following:
a.t
b.t.pg
c.bin
d.bin
e.old
f.txt
g.txt
I should get this by running the script
.t
.t.pg
.bin
.old
.txt
I have a bash shell.
Thanks a lot!
See the BashFAQ entry on ParsingLS for a description of why many of these answers are evil.
The following approach avoids this pitfall (and, by the way, completely ignores files with no extension):
shopt -s nullglob
for f in *.*; do
printf '%s\n' ".${f#*.}"
done | sort -u
Among the advantages:
Correctness: ls behaves inconsistently and can result in inappropriate results. See the link at the top.
Efficiency: Minimizes the number of subprocess invoked (only one, sort -u, and that could be removed also if we wanted to use Bash 4's associative arrays to store results)
Things that still could be improved:
Correctness: this will correctly discard newlines in filenames before the first . (which some other answers won't) -- but filenames with newlines after the first . will be treated as separate entries by sort. This could be fixed by using nulls as the delimiter, or by the aforementioned bash 4 associative-array storage approach.
try this:
ls -1 | sed 's/^[^.]*\(\..*\)$/\1/' | sort -u
ls lists files in your folder, one file per line
sed magic extracts extensions
sort -u sorts extensions and removes duplicates
sed magic reads as:
s/ / /: substitutes whatever is between first and second / by whatever is between second and third /
^: match beginning of line
[^.]: match any character that is not a dot
*: match it as many times as possible
\( and \): remember whatever is matched between these two parentheses
\.: match a dot
.: match any character
*: match it as many times as possible
$: match end of line
\1: this is what has been matched between parentheses
People are really over-complicating this - particularly the regex:
ls | grep -o "\..*" | uniq
ls - get all the files
grep -o "\..*" - -o only show the match; "\..*" match at the first "." & everything after it
uniq - don't print duplicates but keep the same order
you can also sort if you like, but sorting doesn't match the example
This is what happens when you run it:
> ls -1
a.t
a.t.pg
c.bin
d.bin
e.old
f.txt
g.txt
> ls | grep -o "\..*" | uniq
.t
.t.pg
.bin
.old
.txt

How can I make 'grep' show a single line five lines above the grepped line?

I've seen some examples of grepping lines before and after, but I'd like to ignore the middle lines.
So, I'd like the line five lines before, but nothing else.
Can this be done?
OK, I think this will do what you're looking for. It will look for a pattern, and extract the 5th line before each match.
grep -B5 "pattern" filename | awk -F '\n' 'ln ~ /^$/ { ln = "matched"; print $1 } $1 ~ /^--$/ { ln = "" }'
basically how this works is it takes the first line, prints it, and then waits until it sees ^--$ (the match separator used by grep), and starts again.
If you only want to have the 5th line before the match you can do this:
grep -B 5 pattern file | head -1
Edit:
If you can have more than one match, you could try this (exchange pattern with your actual pattern):
sed -n '/pattern/!{H;x;s/^.*\n\(.*\n.*\n.*\n.*\n.*\)$/\1/;x};/pattern/{x;s/^\([^\n]*\).*$/\1/;p}' file
I took this from a Sed tutorial, section: Keeping more than one line in the hold buffer, example 2 and adapted it a bit.
This is option -B
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing -- between contiguous groups of
matches.
This way is easier for me:
grep --no-group-separator -B5 "pattern" file | sed -n 1~5p
This greps 5 lines before and including the pattern, turns off the --- group separator, then prints every 5th line.

Resources