Related
I need to remove an element from an array in bash shell.
Generally I'd simply do:
array=("${(#)array:#<element to remove>}")
Unfortunately the element I want to remove is a variable so I can't use the previous command.
Down here an example:
array+=(pluto)
array+=(pippo)
delete=(pluto)
array( ${array[#]/$delete} ) -> but clearly doesn't work because of {}
Any idea?
The following works as you would like in bash and zsh:
$ array=(pluto pippo)
$ delete=pluto
$ echo ${array[#]/$delete}
pippo
$ array=( "${array[#]/$delete}" ) #Quotes when working with strings
If need to delete more than one element:
...
$ delete=(pluto pippo)
for del in ${delete[#]}
do
array=("${array[#]/$del}") #Quotes when working with strings
done
Caveat
This technique actually removes prefixes matching $delete from the elements, not necessarily whole elements.
Update
To really remove an exact item, you need to walk through the array, comparing the target to each element, and using unset to delete an exact match.
array=(pluto pippo bob)
delete=(pippo)
for target in "${delete[#]}"; do
for i in "${!array[#]}"; do
if [[ ${array[i]} = $target ]]; then
unset 'array[i]'
fi
done
done
Note that if you do this, and one or more elements is removed, the indices will no longer be a continuous sequence of integers.
$ declare -p array
declare -a array=([0]="pluto" [2]="bob")
The simple fact is, arrays were not designed for use as mutable data structures. They are primarily used for storing lists of items in a single variable without needing to waste a character as a delimiter (e.g., to store a list of strings which can contain whitespace).
If gaps are a problem, then you need to rebuild the array to fill the gaps:
for i in "${!array[#]}"; do
new_array+=( "${array[i]}" )
done
array=("${new_array[#]}")
unset new_array
You could build up a new array without the undesired element, then assign it back to the old array. This works in bash:
array=(pluto pippo)
new_array=()
for value in "${array[#]}"
do
[[ $value != pluto ]] && new_array+=($value)
done
array=("${new_array[#]}")
unset new_array
This yields:
echo "${array[#]}"
pippo
This is the most direct way to unset a value if you know it's position.
$ array=(one two three)
$ echo ${#array[#]}
3
$ unset 'array[1]'
$ echo ${array[#]}
one three
$ echo ${#array[#]}
2
This answer is specific to the case of deleting multiple values from large arrays, where performance is important.
The most voted solutions are (1) pattern substitution on an array, or (2) iterating over the array elements. The first is fast, but can only deal with elements that have distinct prefix, the second has O(n*k), n=array size, k=elements to remove. Associative array are relative new feature, and might not have been common when the question was originally posted.
For the exact match case, with large n and k, possible to improve performance from O(nk) to O(n+klog(k)). In practice, O(n) assuming k much lower than n. Most of the speed up is based on using associative array to identify items to be removed.
Performance (n-array size, k-values to delete). Performance measure seconds of user time
N K New(seconds) Current(seconds) Speedup
1000 10 0.005 0.033 6X
10000 10 0.070 0.348 5X
10000 20 0.070 0.656 9X
10000 1 0.043 0.050 -7%
As expected, the current solution is linear to N*K, and the fast solution is practically linear to K, with much lower constant. The fast solution is slightly slower vs the current solution when k=1, due to additional setup.
The 'Fast' solution: array=list of input, delete=list of values to remove.
declare -A delk
for del in "${delete[#]}" ; do delk[$del]=1 ; done
# Tag items to remove, based on
for k in "${!array[#]}" ; do
[ "${delk[${array[$k]}]-}" ] && unset 'array[k]'
done
# Compaction
array=("${array[#]}")
Benchmarked against current solution, from the most-voted answer.
for target in "${delete[#]}"; do
for i in "${!array[#]}"; do
if [[ ${array[i]} = $target ]]; then
unset 'array[i]'
fi
done
done
array=("${array[#]}")
Here's a one-line solution with mapfile:
$ mapfile -d $'\0' -t arr < <(printf '%s\0' "${arr[#]}" | grep -Pzv "<regexp>")
Example:
$ arr=("Adam" "Bob" "Claire"$'\n'"Smith" "David" "Eve" "Fred")
$ echo "Size: ${#arr[*]} Contents: ${arr[*]}"
Size: 6 Contents: Adam Bob Claire
Smith David Eve Fred
$ mapfile -d $'\0' -t arr < <(printf '%s\0' "${arr[#]}" | grep -Pzv "^Claire\nSmith$")
$ echo "Size: ${#arr[*]} Contents: ${arr[*]}"
Size: 5 Contents: Adam Bob David Eve Fred
This method allows for great flexibility by modifying/exchanging the grep command and doesn't leave any empty strings in the array.
Partial answer only
To delete the first item in the array
unset 'array[0]'
To delete the last item in the array
unset 'array[-1]'
To expand on the above answers, the following can be used to remove multiple elements from an array, without partial matching:
ARRAY=(one two onetwo three four threefour "one six")
TO_REMOVE=(one four)
TEMP_ARRAY=()
for pkg in "${ARRAY[#]}"; do
for remove in "${TO_REMOVE[#]}"; do
KEEP=true
if [[ ${pkg} == ${remove} ]]; then
KEEP=false
break
fi
done
if ${KEEP}; then
TEMP_ARRAY+=(${pkg})
fi
done
ARRAY=("${TEMP_ARRAY[#]}")
unset TEMP_ARRAY
This will result in an array containing:
(two onetwo three threefour "one six")
Here's a (probably very bash-specific) little function involving bash variable indirection and unset; it's a general solution that does not involve text substitution or discarding empty elements and has no problems with quoting/whitespace etc.
delete_ary_elmt() {
local word=$1 # the element to search for & delete
local aryref="$2[#]" # a necessary step since '${!$2[#]}' is a syntax error
local arycopy=("${!aryref}") # create a copy of the input array
local status=1
for (( i = ${#arycopy[#]} - 1; i >= 0; i-- )); do # iterate over indices backwards
elmt=${arycopy[$i]}
[[ $elmt == $word ]] && unset "$2[$i]" && status=0 # unset matching elmts in orig. ary
done
return $status # return 0 if something was deleted; 1 if not
}
array=(a 0 0 b 0 0 0 c 0 d e 0 0 0)
delete_ary_elmt 0 array
for e in "${array[#]}"; do
echo "$e"
done
# prints "a" "b" "c" "d" in lines
Use it like delete_ary_elmt ELEMENT ARRAYNAME without any $ sigil. Switch the == $word for == $word* for prefix matches; use ${elmt,,} == ${word,,} for case-insensitive matches; etc., whatever bash [[ supports.
It works by determining the indices of the input array and iterating over them backwards (so deleting elements doesn't screw up iteration order). To get the indices you need to access the input array by name, which can be done via bash variable indirection x=1; varname=x; echo ${!varname} # prints "1".
You can't access arrays by name like aryname=a; echo "${$aryname[#]}, this gives you an error. You can't do aryname=a; echo "${!aryname[#]}", this gives you the indices of the variable aryname (although it is not an array). What DOES work is aryref="a[#]"; echo "${!aryref}", which will print the elements of the array a, preserving shell-word quoting and whitespace exactly like echo "${a[#]}". But this only works for printing the elements of an array, not for printing its length or indices (aryref="!a[#]" or aryref="#a[#]" or "${!!aryref}" or "${#!aryref}", they all fail).
So I copy the original array by its name via bash indirection and get the indices from the copy. To iterate over the indices in reverse I use a C-style for loop. I could also do it by accessing the indices via ${!arycopy[#]} and reversing them with tac, which is a cat that turns around the input line order.
A function solution without variable indirection would probably have to involve eval, which may or may not be safe to use in that situation (I can't tell).
Using unset
To remove an element at particular index, we can use unset and then do copy to another array. Only just unset is not required in this case. Because unset does not remove the element it just sets null string to the particular index in array.
declare -a arr=('aa' 'bb' 'cc' 'dd' 'ee')
unset 'arr[1]'
declare -a arr2=()
i=0
for element in "${arr[#]}"
do
arr2[$i]=$element
((++i))
done
echo "${arr[#]}"
echo "1st val is ${arr[1]}, 2nd val is ${arr[2]}"
echo "${arr2[#]}"
echo "1st val is ${arr2[1]}, 2nd val is ${arr2[2]}"
Output is
aa cc dd ee
1st val is , 2nd val is cc
aa cc dd ee
1st val is cc, 2nd val is dd
Using :<idx>
We can remove some set of elements using :<idx> also. For example if we want to remove 1st element we can use :1 as mentioned below.
declare -a arr=('aa' 'bb' 'cc' 'dd' 'ee')
arr2=("${arr[#]:1}")
echo "${arr2[#]}"
echo "1st val is ${arr2[1]}, 2nd val is ${arr2[2]}"
Output is
bb cc dd ee
1st val is cc, 2nd val is dd
http://wiki.bash-hackers.org/syntax/pe#substring_removal
${PARAMETER#PATTERN} # remove from beginning
${PARAMETER##PATTERN} # remove from the beginning, greedy match
${PARAMETER%PATTERN} # remove from the end
${PARAMETER%%PATTERN} # remove from the end, greedy match
In order to do a full remove element, you have to do an unset command with an if statement. If you don't care about removing prefixes from other variables or about supporting whitespace in the array, then you can just drop the quotes and forget about for loops.
See example below for a few different ways to clean up an array.
options=("foo" "bar" "foo" "foobar" "foo bar" "bars" "bar")
# remove bar from the start of each element
options=("${options[#]/#"bar"}")
# options=("foo" "" "foo" "foobar" "foo bar" "s" "")
# remove the complete string "foo" in a for loop
count=${#options[#]}
for ((i = 0; i < count; i++)); do
if [ "${options[i]}" = "foo" ] ; then
unset 'options[i]'
fi
done
# options=( "" "foobar" "foo bar" "s" "")
# remove empty options
# note the count variable can't be recalculated easily on a sparse array
for ((i = 0; i < count; i++)); do
# echo "Element $i: '${options[i]}'"
if [ -z "${options[i]}" ] ; then
unset 'options[i]'
fi
done
# options=("foobar" "foo bar" "s")
# list them with select
echo "Choose an option:"
PS3='Option? '
select i in "${options[#]}" Quit
do
case $i in
Quit) break ;;
*) echo "You selected \"$i\"" ;;
esac
done
Output
Choose an option:
1) foobar
2) foo bar
3) s
4) Quit
Option?
Hope that helps.
There is also this syntax, e.g. if you want to delete the 2nd element :
array=("${array[#]:0:1}" "${array[#]:2}")
which is in fact the concatenation of 2 tabs. The first from the index 0 to the index 1 (exclusive) and the 2nd from the index 2 to the end.
POSIX shell script does not have arrays.
So most probably you are using a specific dialect such as bash, korn shells or zsh.
Therefore, your question as of now cannot be answered.
Maybe this works for you:
unset array[$delete]
What I do is:
array="$(echo $array | tr ' ' '\n' | sed "/itemtodelete/d")"
BAM, that item is removed.
This is a quick-and-dirty solution that will work in simple cases but will break if (a) there are regex special characters in $delete, or (b) there are any spaces at all in any items. Starting with:
array+=(pluto)
array+=(pippo)
delete=(pluto)
Delete all entries exactly matching $delete:
array=(`echo $array | fmt -1 | grep -v "^${delete}$" | fmt -999999`)
resulting in
echo $array -> pippo, and making sure it's an array:
echo $array[1] -> pippo
fmt is a little obscure: fmt -1 wraps at the first column (to put each item on its own line. That's where the problem arises with items in spaces.) fmt -999999 unwraps it back to one line, putting back the spaces between items. There are other ways to do that, such as xargs.
Addendum: If you want to delete just the first match, use sed, as described here:
array=(`echo $array | fmt -1 | sed "0,/^${delete}$/{//d;}" | fmt -999999`)
Actually, I just noticed that the shell syntax somewhat has a behavior built-in that allows for easy reconstruction of the array when, as posed in the question, an item should be removed.
# let's set up an array of items to consume:
x=()
for (( i=0; i<10; i++ )); do
x+=("$i")
done
# here, we consume that array:
while (( ${#x[#]} )); do
i=$(( $RANDOM % ${#x[#]} ))
echo "${x[i]} / ${x[#]}"
x=("${x[#]:0:i}" "${x[#]:i+1}")
done
Notice how we constructed the array using bash's x+=() syntax?
You could actually add more than one item with that, the content of a whole other array at once.
In ZSH this is dead easy (note this uses more bash compatible syntax than necessary where possible for ease of understanding):
# I always include an edge case to make sure each element
# is not being word split.
start=(one two three 'four 4' five)
work=(${(#)start})
idx=2
val=${work[idx]}
# How to remove a single element easily.
# Also works for associative arrays (at least in zsh)
work[$idx]=()
echo "Array size went down by one: "
[[ $#work -eq $(($#start - 1)) ]] && echo "OK"
echo "Array item "$val" is now gone: "
[[ -z ${work[(r)$val]} ]] && echo OK
echo "Array contents are as expected: "
wanted=("${start[#]:0:1}" "${start[#]:2}")
[[ "${(j.:.)wanted[#]}" == "${(j.:.)work[#]}" ]] && echo "OK"
echo "-- array contents: start --"
print -l -r -- "-- $#start elements" ${(#)start}
echo "-- array contents: work --"
print -l -r -- "-- $#work elements" "${work[#]}"
Results:
Array size went down by one:
OK
Array item two is now gone:
OK
Array contents are as expected:
OK
-- array contents: start --
-- 5 elements
one
two
three
four 4
five
-- array contents: work --
-- 4 elements
one
three
four 4
five
To avoid conflicts with array index using unset - see https://stackoverflow.com/a/49626928/3223785 and https://stackoverflow.com/a/47798640/3223785 for more information - reassign the array to itself: ARRAY_VAR=(${ARRAY_VAR[#]}).
#!/bin/bash
ARRAY_VAR=(0 1 2 3 4 5 6 7 8 9)
unset ARRAY_VAR[5]
unset ARRAY_VAR[4]
ARRAY_VAR=(${ARRAY_VAR[#]})
echo ${ARRAY_VAR[#]}
A_LENGTH=${#ARRAY_VAR[*]}
for (( i=0; i<=$(( $A_LENGTH -1 )); i++ )) ; do
echo ""
echo "INDEX - $i"
echo "VALUE - ${ARRAY_VAR[$i]}"
done
exit 0
[Ref.: https://tecadmin.net/working-with-array-bash-script/ ]
How about something like:
array=(one two three)
array_t=" ${array[#]} "
delete=one
array=(${array_t// $delete / })
unset array_t
#/bin/bash
echo "# define array with six elements"
arr=(zero one two three 'four 4' five)
echo "# unset by index: 0"
unset -v 'arr[0]'
for i in ${!arr[*]}; do echo "arr[$i]=${arr[$i]}"; done
arr_delete_by_content() { # value to delete
for i in ${!arr[*]}; do
[ "${arr[$i]}" = "$1" ] && unset -v 'arr[$i]'
done
}
echo "# unset in global variable where value: three"
arr_delete_by_content three
for i in ${!arr[*]}; do echo "arr[$i]=${arr[$i]}"; done
echo "# rearrange indices"
arr=( "${arr[#]}" )
for i in ${!arr[*]}; do echo "arr[$i]=${arr[$i]}"; done
delete_value() { # value arrayelements..., returns array decl.
local e val=$1; new=(); shift
for e in "${#}"; do [ "$val" != "$e" ] && new+=("$e"); done
declare -p new|sed 's,^[^=]*=,,'
}
echo "# new array without value: two"
declare -a arr="$(delete_value two "${arr[#]}")"
for i in ${!arr[*]}; do echo "arr[$i]=${arr[$i]}"; done
delete_values() { # arraydecl values..., returns array decl. (keeps indices)
declare -a arr="$1"; local i v; shift
for v in "${#}"; do
for i in ${!arr[*]}; do
[ "$v" = "${arr[$i]}" ] && unset -v 'arr[$i]'
done
done
declare -p arr|sed 's,^[^=]*=,,'
}
echo "# new array without values: one five (keep indices)"
declare -a arr="$(delete_values "$(declare -p arr|sed 's,^[^=]*=,,')" one five)"
for i in ${!arr[*]}; do echo "arr[$i]=${arr[$i]}"; done
# new array without multiple values and rearranged indices is left to the reader
This question already has answers here:
A variable modified inside a while loop is not remembered
(8 answers)
Closed 6 years ago.
I have a pretty simple sh script where I make a system cat call, collect the results and parse some relevant information before storing the information in an array, which seems to work just fine. But as soon as I exit the for loop where I store the information, the array seems to clear itself. I'm wondering if I am accessing the array incorrectly outside of the for loop. Relevant portion of my script:
#!/bin/sh
declare -a QSPI_ARRAY=()
cat /proc/mtd | while read mtd_instance
do
# split result into individiual words
words=($mtd_instance)
for word in "${words[#]}"
do
# check for uboot
if [[ $word == *"uboot"* ]]
then
mtd_num=${words[0]}
index=${mtd_num//[!0-9]/} # strip everything except the integers
QSPI_ARRAY[$index]="uboot"
echo "QSPI_ARRAY[] at index $index: ${QSPI_ARRAY[$index]}"
elif [[ $word == *"fpga_a"* ]]
then
echo "found it: "$word""
mtd_num=${words[0]}
index=${mtd_num//[!0-9]/} # strip everything except the integers
QSPI_ARRAY[$index]="fpga_a"
echo "QSPI_ARRAY[] at index $index: ${QSPI_ARRAY[$index]}"
# other items are added to the array, all successfully
fi
done
echo "length of array: ${#QSPI_ARRAY[#]}"
echo "----------------------"
done
My output is great until I exit the for loop. While within the for loop, the array size increments and I can check that the item has been added. After the for loop is complete I check the array like so:
echo "RESULTING ARRAY:"
echo "length of array: ${#QSPI_ARRAY[#]}"
for qspi in "${QSPI_ARRAY}"
do
echo "qspi instance: $qspi"
done
Here are my results, echod to my display:
dev: size erasesize name
length of array: 0
-------------
mtd0: 00100000 00001000 "qspi-fsbl-uboot"
QSPI_ARRAY[] at index 0: uboot
length of array: 1
-------------
mtd1: 00500000 00001000 "qspi-fpga_a"
QSPI_ARRAY[] at index 1: fpga_a
length of array: 2
-------------
RESULTING ARRAY:
length of array: 0
qspi instance:
EDIT: After some debugging, it seems I have two different arrays here somehow. I initialized the array like so: QSPI_ARRAY=("a" "b" "c" "d" "e" "f" "g"), and after my for-loop for parsing the array it is still a, b, c, etc. How do I have two different arrays of the same name here?
This structure:
cat /proc/mtd | while read mtd_instance
do
...
done
Means that whatever comes between do and done cannot have any effects inside the shell environment that are still there after the done.
The fact that the while loop is on the right hand side of a pipe (|) means that it runs in a subshell. Once the loop exits, so does the subshell. And all of its variable settings.
If you want a while loop which makes changes that stick around, don't use a pipe. Input redirection doesn't create a subshell, and in this case, you can just read from the file directly:
while read mtd_instance
do
...
done </proc/mtd
If you had a more complicated command than a cat, you might need to use process substitution. Still using cat as an example, that looks like this:
while read mtd_instance
do
...
done < <(cat /proc/mtd)
In the specific case of your example code, I think you could simplify it somewhat, perhaps like this:
#!/usr/bin/env bash
QSPI_ARRAY=()
while read -a words; do␣
declare -i mtd_num=${words[0]//[!0-9]/}
for word in "${words[#]}"; do
for type in uboot fpga_a; do
if [[ $word == *$type* ]]; then
QSPI_ARRAY[mtd_num]=$type
break 2
fi
done
done
done </proc/mtd
Is this potentially what you are seeing:
http://mywiki.wooledge.org/BashFAQ/024
I was doing an exercise on reading from a setup file in which every line specifies two words and a number. The number denotes the number of words in between the two words specified. Another file – input.txt – has a block of text, and the program attempts to count the number of occurrences in the input file which follows the constraints in each line in the setup file (i.e., two particular words a and b should be separated by n words, where a, b and n are specified in the setup file.
So I've tried to do this as a shell script, but my implementation is probably highly inefficient. I used an array to store the words from the setup file, and then did a linear search on the text file to find out the words, and the works. Here's a bit of the code, if it helps:
#!/bin/sh
j=0
count=0;
m=0;
flag=0;
error=0;
while read line; do
line=($line);
a[j]=${line[0]}
b[j]=${line[1]}
num=${line[2]}
c[j]=`expr $num + 0`
j=`expr $j + 1`
done <input2.txt
while read line2; do
line2=($line2)
for (( i=0; $i<=50; i++ )); do
for (( m=0; $m<j; m++)); do
g=`expr $i + ${c[m]}`
g=`expr $g + 1`
if [ "${line2[i]}" == "${a[m]}" ] ; then
for (( k=$i; $k<$g; k++)); do
if [[ "${line2[k]}" == *.* ]]; then
flag=1
break
fi
done
if [ "${b[m]}" == "${line2[g]}" ] ; then
if [ "$flag" == 1 ] ; then
error=`expr $error + 1`
fi
count=`expr $count + 1`
fi
flag=0
fi
if [ "${line2[i]}" == "${b[m]}" ] ; then
for (( k=$i; $k<$g; k++)); do
if [[ "${line2[k]}" == *.* ]]; then
flag=1
break
fi
done
if [ "${a[m]}" == "${line2[g]}" ] ; then
if [ "$flag" == 1 ] ; then
error=`expr $error + 1`
fi
count=`expr $count + 1`
fi
flag=0
fi
done
done
done <input.txt
count=`expr $count - $error`
echo "| Count = $count |"
As you can see, this takes a lot of time.
I was thinking of a more efficient way to implement this, in C or C++, this time. What could be a possible alternative implementation of this, efficiency considered? I thought of hash tables, but could there be a better way?
I'd like to hear what everyone has to say on this.
Here's a fully working possibility. It is not 100% pure bash since it uses (GNU) sed: I'm using sed to lowercase everything and to get rid of punctuation marks. Maybe you won't need this. Adapt to your needs.
#!/bin/bash
input=input.txt
setup=setup.txt
# The Check function
Check() {
# $1 is word1
# $2 is word2
# $3 is number of words between word1 and word2
nb=0
# Get all positions of w1
IFS=, read -a q <<< "${positions[$1]}"
# Check, for each position, if word2 is at distance $3 from word1
for i in "${q[#]}"; do
[[ ${words[$i+$3+1]} = $2 ]] && ((++nb))
done
echo "$nb"
}
# Slurp input file in an array
words=( $(sed 's/[,.:!?]//g;s/\(.*\)/\L\1/' -- "$input") )
# For each word, specify its positions in file
declare -A positions
pos=0
for i in "${words[#]}"; do
positions[$i]+=$((pos++)),
done
# Do it!
while read w1 w2 p; do
# Check that w1 w2 are not empty
[[ -n $w2 ]] || continue
# Check that p is a number
[[ $p =~ ^[[:digit:]]+$ ]] || continue
n=$(Check "$w1" "$w2" "$p")
[[ $w1 != $w2 ]] && (( n += $(Check "$w2" "$w1" "$p") ))
echo "$w1 $w2 $p: $n"
done < <(sed 's/\(.*\)/\L\1/' -- "$setup")
How does it work:
we first read the whole file input.txt in the array words: a word per field. Observe I'm using sed here to delete all punctuation marks (well, only ,, ., :, !, ?, for testing purposes, add some more if you wish) and to lowercase every letter.
Loop through the array words and for each word, put its position in an associative array positions:
w => "position1,position2,...,positionk,"
Finally, we read the setup.txt file (filtered through sed again to lowercase everything – optional see below). Do a quick check whether the line is valid (2 words and a number) and then call the Check function (twice, for each permutation of the given words, unless both words are equal).
The Check function finds all positions of word1 in file, thanks to associative array positions and then using the array words, check whether word2 is at the given "distance" from word1.
The second sed is optional. I've filtered the setup.txt file through sed to lowercase everything. This sed will leave only very little overhead, so, efficiency-wise, it's not a big deal. You'll be able to add more filtering later to make sure the data is consistent with how the script will use it (e.g., get rid of punctuation marks). Otherwise you could:
Get rid of it altogether: replace the corresponding line (the last line) with just
done < "$setup"
In this case, you'll have to trust the guy/gal who will write the setup.txt file.
Get rid of it as above, but still want to convert everything to lowercase. In this case, below the
while read w1 w2 p; do
line, just add these lines:
w1=${w1,,}
w2=${w2,,}
That's the bash way to lowercase a string.
Caveats. The script will break if:
The number given in setup.txt file starts with a 0 and contains an 8 or a 9. This is because bash will consider it's an octal number, where 8's and 9's are not valid. There are workarounds for this.
The text in input.txt doesn't follow proper typographical practices: a punctuation mark is always followed by a space. E.g., if the input file contains
The quick,brown,dog jumps over the lazy fox
then after the sed treatment the text will look like
The quickbrowndog jumps over the lazy fox
and the words quick, brown and dog won't be treated properly. You can replace the sed substitution s/[,:!?]//g with s/[,:!?]/ /g to convert these symbols with a space. It's up to you, but in that case, abbreviations as, e.g., e.g. and i.e. might not be considered properly… it now really depends what you need to do.
Different character encodings are used… I don't really know how robust you need the script to be, and what languages and encodings you'll consider.
(Add stuff here :).)
About efficiency. I'd say the algorithm is rather efficient. bash is probably not the best suited language for that, but it's a lot of fun, and not that difficult after all if we look at it (less than 20 lines of relevant code, and even less than that!). If you only have 50 files with 50000 words, it's ok, you will not notice too much difference between bash and perl/python/awk/C/you-name-it: bash performs decently quickly for files of this type. Now if you have 100000 files each containing millions of words, well, a different approach should be taken and a different language should be used (but I don't know which one).
If:
it can get complex for the sake of efficiency
the text file can be large
the setup file can have many rows
then I would do it the following way:
As preparation I would create:
A hash map with the index of the word as key and the word as the value (named -say- WORDS). So WORDS[1] would be the first word, WORDS[2] the second, and so on.
A hashmap with the words as keys and the list of indexes as values (named -say- INDEXES). So if WORDS[2] and WORDS[5] is "dog" and none other, than INDEXES["dog"] would yield the numers 2 and 5. The value can be a dynamic indexed array or a linked list. Linked list is better if there are words that occur many times.
You can read the text file, and populate both structures at the same time.
Processing:
For each row of the setup file I would get the indexes in INDEXES[firstword] and check if WORDS[index + wordsinbetween + 1] equals with secondword. If it does, that's a hit.
Notes:
Preparation: You only read the text file once. For each word in the text file, you're doing fast operations thats' performance is not really effected by the amount of words already processed.
Processing: You only read the setup file once. For each row you're here too doing operations that are only effected by the number of occurences of firstword in the text file.
I am trying to write code to break up a large array into many different small arrays. Eventually the array I would be passed is one of unknown size, this is just my test subject. I have gotten this far:
#!/bin/bash
num=(10 3 12 3 4 4)
inArray=${#num[#]}
numArrays=$(($inArray/2))
remain=$(($inArray%2))
echo $numArrays
echo $remain
nun=0
if test $remain -gt $nun; then
numArrays=$(($numArrays+1))
fi
array=(1 2)
j=0
for ((i=0;i<$numArrays;i++, j=j+2)); do
array=("${num[#]:$j:2}")
echo "The array says: ${array[#]}"
echo "The size? ${#array[#]}"
done
What I am really having a problem with is : I would like to make the variable 'array' be able to change names slightly every time, so each array is kept and has a unique name after the loop. I have tried making the name array_$i but that returns:
[Stephanie#~]$ ./tmp.sh
3
0
./tmp.sh: line 16: syntax error near unexpected token `"${num[#]:$j:2}"'
./tmp.sh: line 16: ` array_$i=("${num[#]:$j:2}")'
[Stephanie#RDT00069 ~]$ ./tmp.sh
3
0
./tmp.sh: line 16: syntax error near unexpected token `$i'
./tmp.sh: line 16: ` array($i)=("${num[#]:$j:2}")'
Does anyone have any advice?
Thanks
I don't think you can really avoid eval here, but you might be able to do it safely if you're careful. Here's my approach:
for name in "${!array_*}"; do # Get all names starting with array_
i="${name#array_*}" # Get the part after array_
if [[ $i != *[^0-9]* ]]; then # Check that it's a number.
printf '%s is not a valid subarray name\n' "$name"
else
# Create a variable named "statement" that contains code you want to eval.
printf -v statement 'cur_array=( "${%s[#]}" )' "$name"
eval "$statement"
# Do interesting things with $cur_array
fi
done
Before this, when you're just creating the array, you know what $name should be, so just use the printf -v part.
To make it even safer, you could save all the allowed array names in another array and check that $name is a member.
With simple variables, you can use the declare keyword to make indirect assignments:
v=foo
declare $v=5
echo $foo # Prints 5
This doesn't extend to arrays in the obvious (to me, anyway) sense:
i=2
# This produces a syntax error
declare -a array_$i=("${num[#]:$j:2}")
Instead, you can declare an empty array
declare -a array_$i
or assign items one at a time:
declare -a array_$i[0]=item1 array_$i[1]=item2
Here's an example of using a for-loop to copy, say, the 3rd
and 4th letters of a big array into a smaller one. We use
i as the dynamic part of the name of the smaller array, and
j as the index into that array.
letters=(a b c d e f)
i=1
j=0
for letter in "${letters[#]:2:2}"; do
# E.g., i=0 and j=1 would result in
# declare -a array_0[1]=c
declare -a array_$i[$j]=$letter
let j+=1
done
done
echo ${array_1[#]}; # c d
${foo[#]:x:y} gives us elements x, x+1, ..., x+y-1 from foo, and
You can wrap the whole thing inside another for-loop to accomplish the goal of splitting letters into 3 smaller arrays:
# We'll create array_0, array_1, and array_2
for i in 0 1 2; do
# Just like our subset above, but start at position i*2 instead of
# a constant.
for letter in "${letters[#]:$((i*2)):2}"; do
declare -a array_$i[$j]=$letter
done
done
Once you manage to populate your three arrays, how do you access them without eval? Bash has syntax for indirect access:
v=foo
foo=5
echo ${!v} # echoes 5!
The exclamation point says to use the word that follows as a variable whose value should be used as the name of the parameter to expand. Knowing that, you might think you could do the following, but you'd be wrong.
i=1
v=array_$i # array_1
echo ${!v[0]} # array_1[0] is c, so prints c, right? Wrong.
In the above, bash tries to find a variable called v[0] and expand it to get the name of a parameter to expand. We actually have to treat our array plus its index as a single name:
i=1
v=array_$i[0]
echo ${!v} # This does print c
This should work, but this is not a good solution, another language may be better bash does not support multi dimensional arrays
eval array_$i='('"${num[#]:$j:2}"')'
And then, for example
eval 'echo "${array_'$i'[0]}"'
I have a customized .profile that I use in ksh and below is a function that I created to skip back and forth from directories with overly complicated or long names.
As you can see, the pathnames are stored in an array (BOOKMARKS[]) to keep track of them and reference them at a later time. I want to be able to delete certain values from the array, using a case statement (or OPTARG if necessary) so that I can just type bmk -d # to remove the path at the associated index.
I have fiddled around with array +A and -A, but it just wound up screwing up my array (what is left in the commented out code may not be pretty...I didn't proofread it).
Any suggestions/tips on how to create that functionality? Thanks!
# To bookmark the current directory you are in for easy navigation back and forth from multiple non-aliased directories
# Use like 'bmk' (sets the current directory to a bookmark number) to go back to this directory, i.e. type 'bmk 3' (for the 3rd)
# To find out what directories are linked to which numbers, type 'bmk -l' (lowercase L)
# For every new directory bookmarked, the number will increase so the first time you run 'bmk' it will be 1 then 2,3,4...etc. for every consecutive run therea
fter
# TODO: finish -d (delete bookmark entry) function
make_bookmark()
{
if [[ $# -eq 0 ]]; then
BOOKMARKS[${COUNTER}]=${PWD}
(( COUNTER=COUNTER+1 ))
else
case $1 in
-l) NUM_OF_ELEMENTS=${#BOOKMARKS[*]}
while [[ ${COUNTER} -lt ${NUM_OF_ELEMENTS} ]]
do
(( ACTUAL_NUM=i+1 ))
echo ${ACTUAL_NUM}":"${BOOKMARKS[${i}]}
(( COUNTER=COUNTER+1 ))
done
break ;;
#-d) ACTUAL_NUM=$2
#(( REMOVE=${ACTUAL_NUM}-1 ))
#echo "Removing path ${BOOKMARKS[${REMOVE}]} from 'bmk'..."
#NUM_OF_ELEMENTS=${#BOOKMARKS[*]}
#while [[ ${NUM_OF_ELEMENTS} -gt 0 ]]
#do
#if [[ ${NUM_OF_ELEMENTS} -ne ${ACTUAL_NUM} ]]; then
# TEMP_ARR=$(echo "${BOOKMARKS[*]}")
# (( NUM_OF_ELEMENTS=${NUM_OF_ELEMENTS}-1 ))
#fi
#echo $TEMP_ARR
#done
#break
#for VALUE in ${TEMP_ARR}
#do
# set +A BOOKMARK ${TEMP_ARR}
#done
#echo ${BOOKMARK[*]}
#break ;;
*) (( INDEX=$1-1 ))
cd ${BOOKMARKS[${INDEX}]}
break ;;
esac
fi
}
Arrays in the Korn shell (and Bash and others) are sparse, so if you use unset to delete members of the array, you won't be able to use the size of the array as an index to the last member and other limitations.
Here are some useful snippets (the second for loop is something you might be able to put to use right away):
array=(1 2 3)
unset array[2]
echo ${array[2]} # null
indices=(${!array[#]}) # create an array of the indices of "array"
size=${#indices[#]} # the size of "array" is the number of indices into it
size=${#array[#]} # same
echo ${array[#]: -1} # you can use slices to get array elements, -1 is the last one, etc.
for element in ${array[#]}; do # iterate over the array without an index
for index in ${indices[#]} # iterate over the array WITH an index
do
echo "Index: ${index}, Element: ${array[index]}"
done
for index in ${!array[#]} # iterate over the array WITH an index, directly
That last one can eliminate the need for a counter.
Here are a couple more handy techniques:
array+=("new element") # append a new element without referring to an index
((counter++)) # shorter than ((counter=counter+1)) or ((counter+=1))
if [[ $var == 3 ]] # you can use the more "natural" comparison operators inside double square brackets
while [[ $var < 11 ]] # another example
echo ${array[${index}-1] # math inside an array subscript
This all assumes ksh93, some things may not work in earlier versions.
you can use unset. eg to delete array element 1
unset array[0]
to delete entire array
unset array
A few caveats regarding the previous answer:
First: I see this error all the time. When you provide an array element to "unset", you have to quote it. Consider:
$ echo foo > ./a2
$ ls a[2]
a2
$ a2="Do not delete this"
$ a=(this is not an array)
$ unset -v a[2]
$ echo "a2=${a2-UNSET}, a[]=${a[#]}"
a2=UNSET a[]=this is not an array
What happened? Globbing. You obviously wanted to delete element 2 of a[], but shell syntax being what it is, the shell first checked the current directory for a file that matched the glob pattern "a[2]". If it finds a match, it replaces the glob pattern with that filename, and you wind up making a decision about which variable to delete based on what files exist in your current directory.
This is profoundly stupid. But it's not something anyone has bothered to fix, apparently, and the error turns up in all kinds of documentation and example code from the last 3 decades.
Next is a related problem: it's easy to insert elements in your associative array with any key you like. But it's harder to remove these elements:
typeset -A assoc
key="foo] bar"
assoc[$key]=3 #No problem!
unset -v "assoc[$key]" #Problem!
In bash you can do this:
unset -v "assoc[\$key]"
In Korn Shell, you have to do this:
unset -v "assoc[foo\]\ bar]"
So it gets a bit more complicated in the case where your keys contain syntax characters.