Bash read file to an array based on two delimiters - arrays

I have a file that I need to parse into an array but I really only want a brief portion of each line and for only the first 84 lines.
Sometimes the line maybe:
>MT gi...
And I would just want the MT to be entered into the array. Other times it might be something like this:
>GL000207.1 dn...
and I would need the GL000207.1
I was thinking that you might be able to set two delimiters (one being the '>' and the other being the ' ' whitespace) but I am not sure how you would go about it. I have read other peoples posts about the internal field separator but I am really not sure of how that would work. I would think perhaps something like this might work though?
desiredArray=$(echo file.whatever | tr ">" " ")
for x in $desiredArray
do
echo > $x
done
Any suggestions?

How about:
head -84 <file> | awk '{print $1}' | tr -d '>'
head takes only the first lines of the file, awk strips off the first space and everything after it, and tr gets rid of the '>'.

You can also do it with sed:
head -n 84 <file> | sed 's/>\([^ ]*\).*/\1/'

Related

Parsing HTML to array only returns one word

I'm trying to parse some HTML subtitles into an array using Bash and html-xml-utils, and I've tried using a Lynx dump to pretty it up, but I had the same problem, because I can't get my sed to put more than one word at a time into the array.
Code:
array=($(echo $PAGE |
hxselect -i ".sub_info_container .sub_title" |
sed -r 's/.*\">(.*)<\/a>.*/\1/' ))
echo $array
This gets piped into sed:
<div class="sub_title"><a class="sub_title" href="/link">Some Random Title.</a></div><div class="sub_title"><a class="sub_title" href="/link2">Another subtitle I want.</a>
Output of echo $array:
Some
What I'm trying to get:
Some Random Title
Without the punctuation would be nice, and the subtitles often have ? or ! instead of period, but it could work including punctuation too.
Things I've tried:
Using Lynx to pretty up the code, then using awk to grab the elements
A lot of different sed and awk methods of grabbing the text
I'm not sure why, but my code ended up separating spaces into separate items. The solution was the following code:
array=($(echo $PAGE |
hxselect -i ".sub_info_container .sub_title" |
lynx -stdin -dump | tr " " - ))
I used tr to turn the spaces into dashes, allowing it to be passed into the array. Taking off the extra parenthesis as everybody suggested actually removed the function of assigning the values into an array, as I stated was my intention. After the code completed I simply re-converted all the dashes back to spaces. It's not pretty but it works!
Try this:
s='<div class="sub_title"><a class="sub_title" href="/link">Some Random Title.</a></div><div class="sub_title"><a class="sub_title" href="/link2">Another subtitle I want.</a>'
array=$(echo "$s" | sed 's/<\/div><div /\n/' | sed -r 's/.*\">(.*)<\/a>.*/\1/g')
echo "$array"
I had to add a newline between the divs to match both. I'm not that good with sed and couldn't figure out how to do it without that.
Your main problem was with the extra parenthesis
array=($(echo .....))

Having issues using IFS to cut a string into an array. BASH

I have tried everything I can think of to cut this into separate elements for my array but I am struggling..
Here is what I am trying to do..
(This command just rips out the IP addresses on the first element returned )
$ IFS=$"\n"
$ aaa=( $(netstat -nr | grep -v '^0.0.0.0' | grep -v 'eth' | grep "UGH" | sed 's/ .*//') )
$ echo "${#aaa[#]}"
1
$ echo "${aaa[0]}"
4.4.4.4
5.5.5.5
This shows more than one value when I am looking for the array to separate 4.4.4.4 into ${aaa[0]} and 5.5.5.5 into ${aaa[1]}
I have tried:
IFS="\n"
IFS=$"\n"
IFS=" "
Very confused as I have been working with arrays a lot recently and have never ran into this particular issue.
Can someone tell me what I am doing wrong?
There is a very good example on how to use IFS + read -a to split a string into an array on this other stackoverflow page
How does splitting string to array by 'read' with IFS word separator in bash generated extra space element?
netstat is deprecated, replaced by ss, so I'm not sure how to reproduce your exact problem

linux bash cut first word from each line of a file, assign it to an array and remove duplicates

So I believe my title explains what i am trying to do. right now i am cutting and echoing out the first word which is working, all i need now is to remove duplicates... the reason i want to assign it to an array is so i can combine all the elements and create a single string of comma seperated values that i can place in a new file. maybe there is an easy way to achieve what i am trying to do. I am new to bash scripting so I appreciate any help.
thanks
here is my code so far
#!/bin/bash
cut -d' ' -f1 $1
cut -d' ' -f1 $1 | sort | uniq | paste -sd,
An awk one-liner can do all of it
awk '!a[$1]{} END{for (i in a) print i}' file > output
This awk command creates a array a (unique) and inserts $1 only when it doesn't already exist in the array. Finally list of unique words gets printed in END section.
PS: If order of words is important (as per their appearance in the file):
awk '!($1 in a){a[$1];b[++i]=$1} END{for (k=1; k<=i; k++) print b[k]}' file

variable still empty after the loop problem in shell scripting

I'm very new in shell scripting, and I encountered a problem that is quite wired. The program is rather simple so I just post it here:
#!/bin/bash
list=""
list=`mtx -f /dev/sg2 status | while read status
do
result=$(echo ${status} | grep "Full")
if [ -z "$result" ]; then
continue
else
echo $(echo ${result} | cut -f3 -d' ' | tr -d [:alpha:] | tr -d [:punct:])
fi
done`
echo ${list}
for haha in ${list}
do
printf "current slot is:%s \n" ${haha}
done
What the program does is that it executes mtx -f /dev/sg2 status and goes to each line and see if there's a full disk. If it has "Full" in that line, I'll record the slot number in that line, and put in the list.
Notice that I put a back quote after list= at line 6, and it covers the whole "while" loop after that. The reason is unclear to me, but I got this usage by just googling it. It is said that the while loop will open up a separate shell or something like that, so when the while loop is done, whatever you concatenated in the loop will get lost, so in my initial implementation, list is still empty after the while loop.
My question is: even if the code above works fine, it looks pretty tricky to others, and what's worse, I can only make only ONE list after the loop is done. Is there a better way to fix this so that I can pull out more information from the loop? Like what if I need list2 to store other values? Thanks.
Your shell script does work. If you wanted to get two pieces of info per line insteal of one, you would have to change this line
echo $(echo ${result} | cut -f3 -d' ' | tr -d [:alpha:] | tr -d [:punct:])
to concatenate the desired values separated by a comma or any other "special" character. Then you could parse your list this way :
for haha in ${list}
do
printf "current slot is:%s, secondary info:%s \n" $(echo ${haha} | cut -f1 -d',') $(echo ${haha} | cut -f2 -d',')
done
See this explanation. As a pipe is involved, the while read... code isn't executed in your current shell, but in a subshell (A child process which can't update your current process' (environment/shell) variables).
Choose on of the listed workarounds to make the while read... loop execute in your current shell.

How can I make 'grep' show a single line five lines above the grepped line?

I've seen some examples of grepping lines before and after, but I'd like to ignore the middle lines.
So, I'd like the line five lines before, but nothing else.
Can this be done?
OK, I think this will do what you're looking for. It will look for a pattern, and extract the 5th line before each match.
grep -B5 "pattern" filename | awk -F '\n' 'ln ~ /^$/ { ln = "matched"; print $1 } $1 ~ /^--$/ { ln = "" }'
basically how this works is it takes the first line, prints it, and then waits until it sees ^--$ (the match separator used by grep), and starts again.
If you only want to have the 5th line before the match you can do this:
grep -B 5 pattern file | head -1
Edit:
If you can have more than one match, you could try this (exchange pattern with your actual pattern):
sed -n '/pattern/!{H;x;s/^.*\n\(.*\n.*\n.*\n.*\n.*\)$/\1/;x};/pattern/{x;s/^\([^\n]*\).*$/\1/;p}' file
I took this from a Sed tutorial, section: Keeping more than one line in the hold buffer, example 2 and adapted it a bit.
This is option -B
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing -- between contiguous groups of
matches.
This way is easier for me:
grep --no-group-separator -B5 "pattern" file | sed -n 1~5p
This greps 5 lines before and including the pattern, turns off the --- group separator, then prints every 5th line.

Resources