Shell Script regex matches to array and process each array element - arrays

While I've handled this task in other languages easily, I'm at a loss for which commands to use when Shell Scripting (CentOS/BASH)
I have some regex that provides many matches in a file I've read to a variable, and would like to take the regex matches to an array to loop over and process each entry.
Regex I typically use https://regexr.com/ to form my capture groups, and throw that to JS/Python/Go to get an array and loop - but in Shell Scripting, not sure what I can use.
So far I've played with "sed" to find all matches and replace, but don't know if it's capable of returning an array to loop from matches.
Take regex, run on file, get array back. I would love some help with Shell Scripting for this task.
EDIT:
Based on comments, put this together (not working via shellcheck.net):
#!/bin/sh
examplefile="
asset('1a/1b/1c.ext')
asset('2a/2b/2c.ext')
asset('3a/3b/3c.ext')
"
examplearr=($(sed 'asset\((.*)\)' $examplefile))
for el in ${!examplearr[*]}
do
echo "${examplearr[$el]}"
done

This works in bash on a mac:
#!/bin/sh
examplefile="
asset('1a/1b/1c.ext')
asset('2a/2b/2c.ext')
asset('3a/3b/3c.ext')
"
examplearr=(`echo "$examplefile" | sed -e '/.*/s/asset(\(.*\))/\1/'`)
for el in ${examplearr[*]}; do
echo "$el"
done
output:
'1a/1b/1c.ext'
'2a/2b/2c.ext'
'3a/3b/3c.ext'
Note the wrapping of $examplefile in quotes, and the use of sed to replace the entire line with the match. If there will be other content in the file, either on the same lines as the "asset" string or in other lines with no assets at all you can refine it like this:
#!/bin/sh
examplefile="
fooasset('1a/1b/1c.ext')
asset('2a/2b/2c.ext')bar
foobar
fooasset('3a/3b/3c.ext')bar
"
examplearr=(`echo "$examplefile" | grep asset | sed -e '/.*/s/^.*asset(\(.*\)).*$/\1/'`)
for el in ${examplearr[*]}; do
echo "$el"
done
and achieve the same result.

There are several ways to do this. I'd do with GNU grep with perl-compatible regex (ah, delightful line noise):
mapfile -t examplearr < <(grep -oP '(?<=[(]).*?(?=[)])' <<<"$examplefile")
for i in "${!examplearr[#]}"; do printf "%d\t%s\n" $i "${examplearr[i]}"; done
0 '1a/1b/1c.ext'
1 '2a/2b/2c.ext'
2 '3a/3b/3c.ext'
This uses the bash mapfile command to read lines from stdin and assign them to an array.
The bits you're missing from the sed command:
$examplefile is text, not a filename, so you have to send to to sed's stdin
sed's a funny little language with 1-character commands: you've given it the "a" command, which is inappropriate in this case.
you only want to output the captured parts of the matches, not every line, so you need the -n option, and you need to print somewhere: the p flag in s///p means "print the [line] if a substitution was made".
sed -n 's/asset\(([^)]*)\)/\1/p' <<<"$examplefile"
# or
echo "$examplefile" | sed -n 's/asset\(([^)]*)\)/\1/p'
Note that this returns values like ('1a/1b/1c.ext') -- with the parentheses. If you don't want them, add the -r or -E option to sed: among other things, that flips the meaning of ( and \(

Related

Creating an array of Strings from Grep Command

I'm pretty new to Linux and I've been trying some learning recently. One thing I'm struggling is Within a log file I would like to grep for all the unique IDs that exist and store them in an array.
The format of the ids are like so id=12345678,
I'm struggling though to get these in to an array. So far I've tried a range of things, the below however
a=($ (grep -HR1 `id=^[0-9]' logfile))
echo ${#a[#]}
but the echo count is always returned as 0. So it is clear the populating of the array is not working. Have explored other pages online, but nothing seems to have a clear explanation of what I am looking for exactly.
a=($(grep -Eow 'id=[0-9]+' logfile))
a=("${a[#]#id=}")
printf '%s\n' "${a[#]}"
It's safe to split an unquoted command substitution here, as we aren't printing pathname expansion characters (*?[]), or whitespace (other than the new lines which delimit the list).
If this were not the case, mapfile -t a <(grep ...) is a good alternative.
-E is extended regex (for +)
-o prints only matching text
-w matches a whole word only
${a[#]#id=} strips the id suffix from each array element
Here is an example
my_array=()
while IFS= read -r line; do
my_array+=( "$line" )
done < <( ls )
echo ${#my_array[#]}
printf '%s\n' "${my_array[#]}"
It prints out 14 and then the names of the 14 files in the same folder. Just substitute your command instead of ls and you started.
Suggesting readarray command to make sure it array reads full lines.
readarray -t my_array < <(grep -HR1 'id=^[0-9]' logfile)
printf "%s\n" "${my_array[#]}"

Add text on certain lines of a file, with the added text depending on the output of a command that takes a substring from the line

I'm trying to make a shell script to take an input file (thousands of lines) and produce an output file that is the same except that on certain lines there will be added text. When the text is added to the (middle of) the line, the exact added text will depend on a substring on the line. The correlation between the substring and the added text is complex and comes from an external program that I can call in the shell. I don't have the source for this converter program nor any control over how the mapping is done.
To explain further...
I have an input file of this general format:
Blah Blah Unimportant
Something Something
FIELD_INFO(field_name_1, output_1),
FIELD_INFO(field_name_2, output_2),
Yadda Yadda
The whole file needs to be copied, with added text, but the only important parts for me are the field names (e.g. field_name_1, field_name_2). I have a command line program called "converter" that can take a file of field names and output a list of corresponding actions. Converter cannot operate directly on the input file. The input to converter needs to be just field names and the output of converter has extra information I don't need:
converter_field_name_1 "action1" /* Use this action for field_name_1 */
converter_field_name_2 "action2" /* use this action for field_name_2 */
The desire is to create a second file that looks like this:
Blah Blah Unimportant
Something Something
FIELD_INFO(field_name_1, action1, output_1),
FIELD_INFO(filed_name_2, action2, output_2),
Yadda Yadda
Here is the script I'm working on, but I've hit a wall (or two):
#!/bin/bash
filename="input_file"
# Let's create an array of the field names to feed to the converter program
field_array=($(sed -e '/^\s*FIELD_INFO/ s/FIELD_INFO(\(.*\),.*),/\1/' -e 't' -e 'd' < ${filename}))
# Save the array to a file, to be able to use the converter's file option
printf "%s\n" "${field_array[#]}" > script_field_names.txt
# Use converter on the whole file and extract only the actions into another array
action_array=($(converter -f script_field_names.txt | cut -d'"' -f 2))
# I will make and use an associative array and try to use
# sed to do the substitution
declare -A mapper
for i in ${!field_array[*]}
do
mapper[${field_array[i]}]=${action_array[i]}
done
#Now go back through the file and add action names (source file unchanged)
sed -e "s/FIELD_INFO(\(.*\),\(.*?),\)/FIELD_INFO(\1, ${mapper[\1], \2}/" < ${filename}
I know now that I can't use the sed group capture "\1" as an index into the mapper array like this. It is not working as a key and the output looks like this:
Blah Blah Unimportant
Something Something
FIELD_INFO(field_name_1, , output_1),
FIELD_INFO(field_name_2, , output_2),
Yadda Yadda
My actual script has debug statements scattered throughout and I know the field array, action array, and mapper array are all getting created correctly. But my idea of using the group capture substring from sed as the index into the mapper array is not working because I now know that sed expands the variables before running in the sub-shell, so the mapper[] array is not seeing the substring as an index.
What should I be doing instead? This script may only be used once, but it's too time consuming and error prone to do the addition of the action strings by hand. I want to come up with a way to make this work but I can't tell if I'm close or completely on the wrong path.
sed -e "s/FIELD_INFO(\(.*\),\(.*?),\)/FIELD_INFO(\1, ${mapper[\1], \2}/" < ${filename}
[...]
I now know that sed expands the variables before running in the sub-shell, so the mapper[] array is not seeing the substring as an index.
Good job identifying the problem. Also, the non-greedy quantifier .*? does not work with sed and ${mapper[\1], \2} should probably be ${mapper[\1]}, \2.
If you want to keep your current approach I see two options.
Do the replacement line by line in bash, either by creating a giant sed command string that lists the action for each line, or by executing sed inside a loop for each line while creating the command strings on the fly.
Instead of the array mapper, create a file that lists the actions to be inserted in the order from the file. Then use GNU sed's R filename command. This command inserts the next line from filename. You can use this to insert the correct action each time you come across a filed. However, the linebreak is inserted too. So you have to fiddle with the hold space and so on to remove these linebreaks afterwards.
Both options are not that great. Therefore I'd switch to awk to insert the actions:
sed -En 's/^\s*FIELD_INFO\(([^,]*).*/\1/p' "$filename" > fields
converter -f fields | cut -d\" -f2 > actions
awk '/^\s*FIELD_INFO\(/ {getline a < "actions"; sub(",", ", " a ",")} 1' "$filename"
With GNU grep you can simplify the first line to
grep -Po '^\s*FIELD_INFO\(\K[^,]*' "$filename" > fields
Why not try,
sed -n -e 's/^[ ]*FIELD_INFO(\(.*\),.*,/\1/p' -- input_file > script_field_names.txt
printf '/^[ ]*FIELD_INFO(%s,/ s/(\\(.[^,]*\\), \\(.[^)]*\\))/(\\1, %s, \\2)/\n' \
$(converter -f script_field_names.txt | cut -d'"' -f 2 |
paste -- script_field_names.txt -) |
sed -f /dev/stdin -- input_file
where
paste emits the map of fields (from file) and actions (from stdin)
printf emits a script read by sed from stdin
each script line becomes: /^[ ]*FIELD_INFO(fieldnameN,/ s/(\(.[^,]*\), \(.[^)]*\))/(\1, actionN, \2)/

Strange behaviour with Bash, Arrays and empty spaces

Problem:
Writing a bash script, i'm trying to import a list of products that are inside a csv file into an array:
#!/bin/bash
PRODUCTS=(`csvprintf -f "/home/test/data/input.csv" -x | grep "col2" | sed 's/<col2>//g' | sed 's/<\/col2>//g' | sed -n '1!p' | sed '$ d' | sed 's/ //g'`)
echo ${PRODUCTS[#]}
In the interactive shell, the result/output looks perfect as following:
burger
special fries
juice - 300ml
When i use exactly the same commands in bash script, even debugging with bash -x script.sh, in the part of echo ${PRODUCTS[#]}, the result of array is all files names located at /home/test/data/ and:
burger
special
fries
juice
-
300ml
The array is taking directory list AND messed up newlines. This don't happen in interactive shell (single command line).
Anyone know how to fix that?
Looking at the docs for csvprintf, you're converting the csv into XML and then parsing it with regular expressions. This is generally a very bad idea.
You might want to install csvkit then you can do
csvcut -c prod input.csv | sed 1d
Or you could use a language that comes with a CSV parser module. For example, ruby
ruby -rcsv -e 'CSV.read("input.csv", :headers=>true).each {|row| puts row["prod"]}'
Whichever method you use, read the results into a bash array with this construct
mapfile -t products < <(command to extract the product data)
Then, to print the array elements:
for prod in "${products[#]}"; do echo "$prod"; done
# or
printf "%s\n" "${products[#]}"
The quotes around the array expansion are critical. If missing, you'll see one word per line.
Tip: don't use ALLCAPS variable names in the shell: leave those for the shell. One day you'll write PATH=something and then wonder why your script is broken.

Bash edit last character in file at specified line

I use a .txt filet as database like this:
is-program-installed= 0
is-program2-installed= 1
is-script3-runnig= 0
is-var5-declared= 1
But what if i uninstall program 2 and i want to set its database value to "0"?
One way to do using sed:
sed -e '/is-program2-/ s/.$/0/' -i file.txt
It works like this:
The s/.$/0/ replaces the last character with 0: the dot matches any character, and the $ matches the end of the line--hence .$ is the last character on the line.
The /is-program2-/ is a filter, so that the replacement is only executed for matching lines.
The filter pattern I used is a bit lazy: it's short but inaccurate. A longer, more strict solution would be:
sed -e '/^is-program2-installed= / s/.$/0/' -i file.txt
You can wrap a sed command (see #janos' answer) in a function for ease of use:
Define:
function markuninstalled () {
PROGRAM=${1?Usage: markuninstalled PROGRAM [FILE]}
FILE=${2:-file.txt}
sed -e "/^is-$PROGRAM-/ s/.$/0/" -i.bak $FILE
}
and then, use it like this:
markuninstalled program2
and it will modify the default file file.txt and create a copy.

How do I let sed 'w' command know where the filename ends?

Every example I was able to find demonstrating the w command of sed has it in the end of the script. What if I can't do that?
An example will probably demonstrate the problem better:
$ echo '123' | sed 'w tempfile; s/[0-9]/\./g'
sed: couldn't open file tempfile; s/[0-9]/\./g: No such file or directory
(How) can I change the above so that sed knows where the filename ends?
P.S. I'm aware that I can do
$ echo '123' | sed 'w tempfile
> s/[0-9]/\./g'
...
Are there prettier options?
P.P.S. People tend to suggest to split it in two scripts. The question is then: is it safe? What if I was going to branch somewhere after the w command, and so on. Can someone confirm that any script can be split in two after any command and that will not affect the results?
Final edit: I checked that multiple -e work just as concatenated commands. I thought it was more complex (like the first one should always exit before the second one starts, etc.). However, I tried splitting a {..} block of commands between two scripts and it still worked, so the w thing is really not a serious problem. Thanks to all.
You can give a two line script to sed in one shell line:
echo '123' | sed -e 'w tempfile' -e 's/[0-9]/\./g'
This might work for you (if you're using BASH and probably GNU sed):
echo '123' | sed 'w tempfile'$'\n'';s/[0-9]/\./g'
Explanation:
The r, R and w commands need a newline to terminate the file name.
The answer to the question is "newline":
sed will treat a non-escaped literal newline as the end of the file name.
If your shell is bash, or supports the $'\n' syntax, you can solve the OP's original question this way:
echo '123' | sed 'w tempfile'$'\n''s/[0-9]/\./g'
In a more limited sh you can say
$ echo '123' | sed 'w tempfile'\
> 's/[0-9]/\./g'
What I did here was write \ as an escape, then hit enter and wrote the rest of the command there. Note that here I am escaping the newline from bash but it is being passed to sed.
Reverse the 2 sed command sequences like this:
echo '123' | sed 's/[0-9]/\./g;w tempfile'
i.e. perform replacements first and then write pattern space into a file.
EDIT: There was some misunderstanding whether OP wants replaced text in final file or not. My above command puts replaced text in tempfile. Since this is not what OP wanted here is one more version that avoids it:
echo '123' | sed -e 'h;s/[0-9]/\./g;g;w tempfile'

Resources