reading multiple matches into arrays with bash

reading multiple matches into arrays with bash - arrays

The utility 'sas2ircu' can output multiple lines for every hard drive attached to the host. A sample of the output for a single drive looks like this:
Enclosure # : 5
Slot # : 20
SAS Address : 5003048-0-185f-b21c
State : Ready (RDY)
I have a bash script that executes the sas2ircu command and does the following with the output:
identifies a drive by the RDY string
reads the numerical value of the enclosure (ie, 5) into an array 'enc'
reads the numerical value of the slot (ie, 20) into another array 'slot'
The code I have serves its purpose, but I'm trying to figure out if I can combine it into a single line and run the sas2ircu command once instead of twice.
mapfile -t enc < <(/root/sas2ircu 0 display|grep -B3 RDY|awk '/Enclosure/{print $NF}')
mapfile -t slot < <(/root/sas2ircu 0 display|grep -B2 RDY|awk '/Slot/{print $NF}')
I've done a bunch of reading on awk but I'm still quite novice with it and haven't come up with anything better than what I have. Suggestions?

Should be able to eliminate the grep and combine the awk scripts into a single awk script; the general idea is to capture the enclosure and slot data and then if/when we see State/RDY we print the enclosure and slot to stdout:
awk '/Enclosure/{enclosure=$NF}/Slot/{slot=$NF}/State.*(RDY)/{print enclosure,slot}'
I don't have sas2ircu so I'll simulate some data (based on OP's sample):
$ cat raw.dat
Enclosure # : 5
Slot # : 20
SAS Address : 5003048-0-185f-b21c
State : Ready (RDY)
Enclosure # : 7
Slot # : 12
SAS Address : 5003048-0-185f-b21c
State : Ready (RDY)
Enclosure # : 9
Slot # : 23
SAS Address : 5003048-0-185f-b21c
State : Off (OFF)
Simulating thw sas2ircu call:
$ cat raw.dat | awk '/Enclosure/{enclosure=$NF}/Slot/{slot=$NF}/State.*(RDY)/{print enclosure,slot}'
5 20
7 12
The harder part is going to be reading these into 2 separate arrays and I'm not aware of an easy way to do this with a single command (eg, mapfile doesn't provide a way to split an input file across 2 arrays).
One idea using a bash/while loop:
unset enc slot
while read -r e s
do
enc+=( ${e} )
slot+=( ${s} )
done < <(cat raw.dat | awk '/Enclosure/{enclosure=$NF}/Slot/{slot=$NF}/State.*(RDY)/{print enclosure,slot}')
This generates:
$ typeset -p enc slot
declare -a enc=([0]="5" [1]="7")
declare -a slot=([0]="20" [1]="12")

Related

Check if an indexed array in bash is sparse or dense

I have a dynamically generated, indexed array in bash and want to know whether it is sparse or dense.
An array is sparse iff there are unset indices before the last entry. Otherwise the array is dense.
The check should work in every case, even for empty arrays, very big arrays (exceeding ARG_MAX when expanded), and of course arrays with arbitrary entries (for instance null entries or entries containing *, \, spaces, and linebreaks). The latter should be fairly easy, as you probably don't want to expand the values of the array anyways.
Ideally, the check should be efficient and portable.
Here are some basic test cases to check your solution.
Your check can use the hard-coded global variable name a for compatibility with older bash versions. For bash 4.3 and higher you may want to use local -n isDense_array="$1" instead, so that you can specify the array to be checked.
isDense() {
# INSERT YOUR CHECK HERE
# if array `a` is dense, return 0 (success)
# if array `a` is sparse, return any of 1-255 (failure)
}
test() {
isDense && result=dense || result=sparse
[[ "$result" = "$expected" ]] ||
echo "Test in line $BASH_LINENO failed: $expected array considered $result"
}
expected=dense
a=(); test
a=(''); test
a=(x x x); test
expected=sparse
a=([1]=x); test
a=([1]=); test
a=([0]=x [2]=x); test
a=([4]=x [5]=x [6]=x); test
a=([0]=x [3]=x [4]=x [13]=x); test
To benchmark your check, you can use
a=($(seq 9999999))
time {
isDense
unset 'a[0]'; isDense
a[0]=1; unset 'a[9999998]'; isDense
a=([0]=x [9999999999]=x); isDense
}

Approach
Non-empty, dense arrays have indices from 0 to ${#a[*]}-1. Due to the pigeonhole principle, the last index of a sparse array must be greater or equal ${#a[#]}.
Bash Script
To get the last index, we assume that the list of indices ${!a[#]} is in ascending order. Bash's manual does not specify any order, but (at least for bash 5 and below) the implementation guarantees this order (in the source code file array.c search for array_keys_to_word_list).
isDense() {
[[ "${#a[*]}" = 0 || " ${!a[*]}" == *" $((${#a[*]}-1))" ]]
}
For small arrays this works very well. For huge arrays the check is a bit slow because of the ${!a[*]}. The benchmark from the question took 9.8 seconds.
Loadable Bash Builtin
The approach in this answer only needs the last index. But bash only allows to extract all indices using ${!a[*]} which is unnecessary slow. Internally, bash knows what the last index is. So if you wanted, you could write a loadable builtin that has access to bash's internal data structures.
Of course this is not a really practical solution. If the performance really did matter that much, you shoudn't use a bash script. Nevertheless, I wrote such a builtin for the fun of it.
Loadable bash builtin
The space and time complexity of above builtin is indepent of the size and structure of the array. Checking isdense a should be as fast as something like b=1.

UPDATE: (re)ran tests in an Ubuntu 20 VM (much better times than previous tests in a cygwin/bash env; times are closer to those reported by Socowi)
NOTE: I populated my array with 10 million entries ({0..9999999}).
Using the same assumption as Socowi ...
To get the last index of an array, we assume that the list of indices ${!a[#]} is in ascending order. Bash's manual does not specify any order, but (at least for bash 5 and below) the implementation guarantees this order
... we can make the same assumption about the output from typeset -p a, namely, output is ordered by index.
Finding the last index:
$ a[9999999]='[abcdef]'
$ typeset -p a | tail -c 40
9999998]="9999998" [9999999]="[abcdef]")
1st attempt using awk to strip off the last index:
$ typeset -p a | awk '{delim="[][]"; split($NF,arr,delim); print arr[2]}'
9999999
This is surprisingly slow (> 3.5 minutes for cygwin/bash) while running at 100% (single) cpu utilization and eating up ~4 GB of memory.
2nd attempt using tail -c to limit the data fed to awk:
$ typeset -p a | tail -c 40 | awk '{delim="[][]"; split($NF,arr,delim); print arr[2]}'
9999999
This came in at a (relatively) blazing speed of ~1.6 seconds (ubuntu/bash).
3rd attempt saving last array value in local variable before parsing with awk
NOTES:
as Socowi's pointed out (in comments), the contents of the last element could contain several characters (spaces, brackets, single/double quotes, etc) that make parsing quite complicated
one workaround is to save the last array value in a variable, parse the typeset -p output, then put the value back
accessing the last array value can be accomplished via array[-1] (requires bash 4.3+)
This also comes in at a (relatively) blazing speed of ~1.6 seconds (ubuntu/bash):
$ lastv="${a[-1]}"
$ a[-1]=""
$ typeset -p a | tail -c 40 | awk -F'[][]' '{print $(NF-1)}'
9999999
$ a[-1]="${lastv}"
Function:
This gives us the following isDense() function:
initial function
unset -f isDense
isDense() {
local last=$(typeset -p a | tail -c 40 | awk '{delim="[][]"; split($NF,arr,delim); print arr[2]}')
[[ "${#a[#]}" = 0 || "${#a[#]}" -ne "${last}" ]]
}
latest function (variable=last array value)
unset -f isDense
isDense() {
local lastv="${a[-1]}"
a[-1]=""
local last=$(typeset -p a | tail -c 40 | awk -F'[][]' '{print $(NF-1)}')
[[ "${#a[#]}" = 0 || "${#a[#]}" -ne "${last}" ]]
rc=$?
a[-1]="${lastv}"
return "${rc}"
}
Benchmark:
from the question (minus the 4th test - I got tired of waiting to repopulate my 10-million entry array) which means ...
keep in mind the benchmark/test is calling isDense() 3x times
ran each function a few times in ubuntu/bash and these were the best times ...
Socowi's function:
real 0m11.717s
user 0m9.486s
sys 0m1.982s
oguz ismail's function:
real 0m10.450s
user 0m9.899s
sys 0m0.546s
My initial typeset|tail-c|awk function:
real 0m4.514s
user 0m3.574s
sys 0m1.442s
Latest test (variable=last array value) with Socowi's declare|tr|tail-n function:
real 0m5.306s
user 0m4.130s
sys 0m2.670s
Latest test (variable=last array value) with original typeset|tail-c|awk function:
real 0m4.305s
user 0m3.247s
sys 0m1.761s

isDense() {
local -rn array="$1"
((${#array[#]})) || return 0
local -ai indices=("${!array[#]}")
((indices[-1] + 1 == ${#array[#]}))
}
To follow the self-contained call convention:
test() {
local result
isDense "$1" && result=dense || result=sparse
[[ "$result" = "$2" ]] ||
echo "Test in line $BASH_LINENO failed: $2 array considered $result"
}
a=(); test a dense
a=(''); test a dense
a=(x x x); test a dense
a=([1]=x); test a sparse
a=([1]=); test a sparse
a=([0]=x [2]=x); test a sparse
a=([4]=x [5]=x [6]=x); test a sparse
a=([0]=x [3]=x [4]=x [13]=x); test a sparse

How do I convert CSV data into an associative array using Bash 4?

The file /tmp/file.csv contains the following:
name,age,gender
bob,21,m
jane,32,f
The CSV file will always have headers.. but might contain a different number of fields:
id,title,url,description
1,foo name,foo.io,a cool foo site
2,bar title,http://bar.io,a great bar site
3,baz heading,https://baz.io,some description
In either case, I want to convert my CSV data into an array of associative arrays..
What I need
So, I want a Bash 4.3 function that takes CSV as piped input and sends the array to stdout:
/tmp/file.csv:
name,age,gender
bob,21,m
jane,32,f
It needs to be used in my templating system, like this:
{{foo | csv_to_array | foo2}}
^ this is a fixed API, I must use that syntax .. foo2 must receive the array as standard input.
The csv_to_array func must do it's thing, so that afterwards I can do this:
$ declare -p row1; declare -p row2; declare -p new_array;
and it would give me this:
declare -A row1=([gender]="m" [name]="bob" [age]="21" )
declare -A row2=([gender]="f" [name]="jane" [age]="32" )
declare -a new_array=([0]="row1" [1]="row2")
..Once I have this array structure (an indexed array of associative array names), I have a shell-based templating system to access them, like so:
{{#new_array}}
Hi {{item.name}}, you are {{item.age}} years old.
{{/new_array}}
But I'm struggling to generate the arrays I need..
Things I tried:
I have already tried using this as a starting point to get the array structure I need:
while IFS=',' read -r -a my_array; do
echo ${my_array[0]} ${my_array[1]} ${my_array[2]}
done <<< $(cat /tmp/file.csv)
(from Shell: CSV to array)
..and also this:
cat /tmp/file.csv | while read line; do
line=( ${line//,/ } )
echo "0: ${line[0]}, 1: ${line[1]}, all: ${line[#]}"
done
(from https://www.reddit.com/r/commandline/comments/1kym4i/bash_create_array_from_one_line_in_csv/cbu9o2o/)
but I didn't really make any progress in getting what I want out the other end...
EDIT:
Accepted the 2nd answer, but I had to hack the library I am using to make either solution work..
I'll be happy to look at other answers, which do not export the declare commands as strings, to be run in the current env, but instead somehow hoist the resultant arrays of the declare commands to the current env (the current env is wherever the function is run from).
Example:
$ cat file.csv | csv_to_array
$ declare -p row2 # gives the data
So, to be clear, if the above ^ works in a terminal, it'll work in the library I'm using without the hacks I had to add (which involved grepping STDIN for ^declare -a and using source <(cat); eval $STDIN... in other functions)...
See my comments on the 2nd answer for more info.

The approach is straightforward:
Read the column headers into an array
Read the file line by line, in each line …
Create a new associative array and register its name in the array of array names
Read the fields and assign them according to the column headers
In the last step we cannot use read -a, mapfile, or things like these since they only create regular arrays with numbers as indices, but we want an associative array instead, so we have to create the array manually.
However, the implementation is a bit convoluted because of bash's quirks.
The following function parses stdin and creates arrays accordingly.
I took the liberty to rename your array new_array to rowNames.
#! /bin/bash
csvToArrays() {
IFS=, read -ra header
rowIndex=0
while IFS= read -r line; do
((rowIndex++))
rowName="row$rowIndex"
declare -Ag "$rowName"
IFS=, read -ra fields <<< "$line"
fieldIndex=0
for field in "${fields[#]}"; do
printf -v quotedFieldHeader %q "${header[fieldIndex++]}"
printf -v "$rowName[$quotedFieldHeader]" %s "$field"
done
rowNames+=("$rowName")
done
declare -p "${rowNames[#]}" rowNames
}
Calling the function in a pipe has no effect. Bash executes the commands in a pipe in a subshell, therefore you won't have access to the arrays created by someCommand | csvToArrays. Instead, call the function as either one of the following
csvToArrays < <(someCommand) # when input comes from a command, except "cat file"
csvToArrays < someFile # when input comes from a file
Bash scripts like these tend to be very slow. That's the reason why I didn't bother to extract printf -v quotedFieldHeader … from the inner loop even though it will do the same work over and over again.
I think the whole templating thing and everything related would be way easier to program and faster to execute in languages like python, perl, or something like that.

The following script:
csv_to_array() {
local -a values
local -a headers
local counter
IFS=, read -r -a headers
declare -a new_array=()
counter=1
while IFS=, read -r -a values; do
new_array+=( row$counter )
declare -A "row$counter=($(
paste -d '' <(
printf "[%s]=\n" "${headers[#]}"
) <(
printf "%q\n" "${values[#]}"
)
))"
(( counter++ ))
done
declare -p new_array ${!row*}
}
foo2() {
source <(cat)
declare -p new_array ${!row*} |
sed 's/^/foo2: /'
}
echo "==> TEST 1 <=="
cat <<EOF |
id,title,url,description
1,foo name,foo.io,a cool foo site
2,bar title,http://bar.io,a great bar site
3,baz heading,https://baz.io,some description
EOF
csv_to_array |
foo2
echo "==> TEST 2 <=="
cat <<EOF |
name,age,gender
bob,21,m
jane,32,f
EOF
csv_to_array |
foo2
will output:
==> TEST 1 <==
foo2: declare -a new_array=([0]="row1" [1]="row2" [2]="row3")
foo2: declare -A row1=([url]="foo.io" [description]="a cool foo site" [id]="1" [title]="foo name" )
foo2: declare -A row2=([url]="http://bar.io" [description]="a great bar site" [id]="2" [title]="bar title" )
foo2: declare -A row3=([url]="https://baz.io" [description]="some description" [id]="3" [title]="baz heading" )
==> TEST 2 <==
foo2: declare -a new_array=([0]="row1" [1]="row2")
foo2: declare -A row1=([gender]="m" [name]="bob" [age]="21" )
foo2: declare -A row2=([gender]="f" [name]="jane" [age]="32" )
The output comes from foo2 function.
The csv_to_array function first reads the headaers. Then for each read line it adds new element into new_array array and also creates a new associative array with the name row$index with elements created from joining the headers names with values read from the line. On the end the output from declare -p is outputted from the function.
The foo2 function sources the standard input, so the arrays come into scope for it. It outputs then those values again, prepending each line with foo2:.

Shell Array Cleared for Unknown Reason [duplicate]

This question already has answers here:
A variable modified inside a while loop is not remembered
(8 answers)
Closed 6 years ago.
I have a pretty simple sh script where I make a system cat call, collect the results and parse some relevant information before storing the information in an array, which seems to work just fine. But as soon as I exit the for loop where I store the information, the array seems to clear itself. I'm wondering if I am accessing the array incorrectly outside of the for loop. Relevant portion of my script:
#!/bin/sh
declare -a QSPI_ARRAY=()
cat /proc/mtd | while read mtd_instance
do
# split result into individiual words
words=($mtd_instance)
for word in "${words[#]}"
do
# check for uboot
if [[ $word == *"uboot"* ]]
then
mtd_num=${words[0]}
index=${mtd_num//[!0-9]/} # strip everything except the integers
QSPI_ARRAY[$index]="uboot"
echo "QSPI_ARRAY[] at index $index: ${QSPI_ARRAY[$index]}"
elif [[ $word == *"fpga_a"* ]]
then
echo "found it: "$word""
mtd_num=${words[0]}
index=${mtd_num//[!0-9]/} # strip everything except the integers
QSPI_ARRAY[$index]="fpga_a"
echo "QSPI_ARRAY[] at index $index: ${QSPI_ARRAY[$index]}"
# other items are added to the array, all successfully
fi
done
echo "length of array: ${#QSPI_ARRAY[#]}"
echo "----------------------"
done
My output is great until I exit the for loop. While within the for loop, the array size increments and I can check that the item has been added. After the for loop is complete I check the array like so:
echo "RESULTING ARRAY:"
echo "length of array: ${#QSPI_ARRAY[#]}"
for qspi in "${QSPI_ARRAY}"
do
echo "qspi instance: $qspi"
done
Here are my results, echod to my display:
dev: size erasesize name
length of array: 0
-------------
mtd0: 00100000 00001000 "qspi-fsbl-uboot"
QSPI_ARRAY[] at index 0: uboot
length of array: 1
-------------
mtd1: 00500000 00001000 "qspi-fpga_a"
QSPI_ARRAY[] at index 1: fpga_a
length of array: 2
-------------
RESULTING ARRAY:
length of array: 0
qspi instance:
EDIT: After some debugging, it seems I have two different arrays here somehow. I initialized the array like so: QSPI_ARRAY=("a" "b" "c" "d" "e" "f" "g"), and after my for-loop for parsing the array it is still a, b, c, etc. How do I have two different arrays of the same name here?

This structure:
cat /proc/mtd | while read mtd_instance
do
...
done
Means that whatever comes between do and done cannot have any effects inside the shell environment that are still there after the done.
The fact that the while loop is on the right hand side of a pipe (|) means that it runs in a subshell. Once the loop exits, so does the subshell. And all of its variable settings.
If you want a while loop which makes changes that stick around, don't use a pipe. Input redirection doesn't create a subshell, and in this case, you can just read from the file directly:
while read mtd_instance
do
...
done </proc/mtd
If you had a more complicated command than a cat, you might need to use process substitution. Still using cat as an example, that looks like this:
while read mtd_instance
do
...
done < <(cat /proc/mtd)
In the specific case of your example code, I think you could simplify it somewhat, perhaps like this:
#!/usr/bin/env bash
QSPI_ARRAY=()
while read -a words; do␣
declare -i mtd_num=${words[0]//[!0-9]/}
for word in "${words[#]}"; do
for type in uboot fpga_a; do
if [[ $word == *$type* ]]; then
QSPI_ARRAY[mtd_num]=$type
break 2
fi
done
done
done </proc/mtd

Is this potentially what you are seeing:
http://mywiki.wooledge.org/BashFAQ/024

Parallelizing a for Loop which accesses files

Here is the complete code. In BER_SB, values of K,SB passed to rand-src command and value of sigama passed to transmit command are being calculated in main. Vlues written to BER array by BER_SB are being further used in main.
BER_SB()
{
s=$1
mkdir "$1"
cp ex-ldpc36-5000a.pchk ex-ldpc36-5000a.gen "$1"
cd "$1"
rand-src ex-ldpc36-5000a.src $s "$K"x"$SB"
encode ex-ldpc36-5000a.pchk ex-ldpc36-5000a.gen ex-ldpc36-5000a.src ex-ldpc36-5000a.enc
transmit ex-ldpc36-5000a.enc ex-ldpc36-5000a.rec 1 awgn $sigma
decode ex-ldpc36-5000a.pchk ex-ldpc36-5000a.rec ex-ldpc36-5000a.dec awgn $sigma prprp 250
BER="$(verify ex-ldpc36-5000a.pchk ex-ldpc36-5000a.dec ex-ldpc36-5000a.gen ex-ldpc36-5000a.src)"
echo $BER
}
export BER
export -f BER_SB
K=5000 # No of Message Bits
N=10000 # No of encoded bits
R=$(echo "scale=3; $K/$N" | bc) # Code Rate
# Creation of Parity Check and Generator files
make-ldpc ex-ldpc36-5000a.pchk $K $N 2 evenboth 3 no4cycle
make-gen ex-ldpc36-5000a.pchk ex-ldpc36-5000a.gen dense
# Creation of file to write BER values
echo>/media/BER/BER_LDPC36_5000_E.txt -n
S=1; # Variable to control no of blocks of source messages
for Eb_No in 0.5 1.0
do
B=$(echo "10^($S+1)" | bc)
# No of Blocks are increased for higher Eb/No values
S=$(($S+1))
# As we have four cores in our PC so we will divide number of source blocks into four subblocks to process these in parallel
SB=$(echo "$B/4" | bc)
# Calculation of Noise Variance from Eb/No values
tmp=$(echo "scale=3; e(($Eb_No/10)*l(10))" | bc -l)
sigma=$(echo "scale=3; sqrt(1/(2*$R*$tmp))" | bc)
# Calling of functions to process the each subbloc
parallel BER_SB ::: 1 2 3 4
BER_T= Here I want to process values of BER variables returned by BER_SB function
done

It is not very clear what you want done. From what you write it seems you want the same 3 lines run 4 times in parallel. That is easily done:
runone() {
mkdir "$1"
cd "$1"
rand-src ex-ldpc36-5000a.src 0 5000 1000
encode ex-ldpc36-5000a.pchk ex-ldpc36-5000a.gen ex-ldpc36-5000a.src ex-ldpc36-5000a.enc
transmit ex-ldpc36-5000a.enc ex-ldpc36-5000a.rec 1 awgn .80
}
export -f runone
parallel runone ::: 1 2 3 4
But that does not use the '1 2 3 4' for anything. If you want the '1 2 3 4' used for anything you will need to describe better what you really want.
Edit:
It is unclear whether you have:
Read the examples: LESS=+/EXAMPLE: man parallel
Walked through the tutorial: man parallel_tutorial
Watched the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
and whether I can assume that the material covered in those are known to you.
In your code you use BER[1]..BER[4], but they are not initialized. You also use BER[x] in the function. Maybe you forgot that a sub-shell cannot pass values in an array back to its parent?
If I were you I would move all the computation in the function and call the function with all needed parameters instead of passing them as environment variables. Something like:
parallel BER_SB ::: 1 2 3 4 ::: 0.5 1.0 ::: $S > computed.out
post process computed.out >>/media/BER/BER_LDPC36_5000_E.txt
To keep the arguments in computed.out you can use --tag. That may make it easier to postprocess.

What is the shell script instruction to divide a file with sorted lines to small files?

I have a large text file with the next format:
1 2327544589
1 3554547564
1 2323444333
2 3235434544
2 3534532222
2 4645644333
3 3424324322
3 5323243333
...
And the output should be text files with a suffix in the name with the number of the first column of the original file keeping the number of the second column in the corresponding output file as following:
file1.txt:
2327544589
3554547564
2323444333
file2.txt:
3235434544
3534532222
4645644333
file3.txt:
3424324322
5323243333
...
The script should run on Solaris but I'm also having trouble with the instruction awk and options of another instruccions like -c with cut; its very limited so I am searching for common commands on Solaris. I am not allowed to change or install anything on the system. Using a loop is not very efficient because the script takes too long with large files. So aside from using the awk instruction and loops, any suggestions?

Something like this perhaps:
$ awk 'NF>1{print $2 > "file"$1".txt"}' input
$ cat file1.txt
2327544589
3554547564
2323444333
or if you have bash available, try this:
#!/bin/bash
while read a b
do
[ -z $a ] && continue
echo $b >> "file"$a".txt"
done < input
output:
$ paste file{1..3}.txt
2327544589 3235434544 3424324322
3554547564 3534532222 5323243333
2323444333 4645644333

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

reading multiple matches into arrays with bash - arrays

Related

Check if an indexed array in bash is sparse or dense

How do I convert CSV data into an associative array using Bash 4?

Shell Array Cleared for Unknown Reason [duplicate]

Parallelizing a for Loop which accesses files

What is the shell script instruction to divide a file with sorted lines to small files?

Categories

Resources