Random element from an array bigger than 32767 in bash - arrays

Having:
mapfile -t words < <( head -10000 /usr/share/dict/words)
echo "${#words[#]}" #10000
r=$(( $RANDOM % ${#words[#]} ))
echo "$r ${words[$r]}"
This select a random word from the array of 10k words.
But having the array bigger as 32767 (e.g. whole file 200k+ words), it stops work because the $RANDOM is only up to 32767. From man bash:
Each time this parameter is referenced, a random integer between 0 and 32767 is generated.
mapfile -t words < /usr/share/dict/words
echo "${#words[#]}" # 235886
r=$(( $RANDOM % ${#words[#]} )) #how to change this?
echo "$r ${words[$r]}"
Don't want use some perl like perl -plE 's/.*/int(rand()*$_)/e', not every system have perl installed. Looking for the simplest possible solution - and also don't care about the true randomness - it isn't for cryptography. :)

One possible solution is to do some maths with the outcome of $RANDOM:
big_random=`expr $RANDOM \* 32767 + $RANDOM`
Another is to use $RANDOM once to pick a block of the input file, then $RANDOM again to pick a line from within that block.
Note that $RANDOM doesn't allow you to specify a range. % gives a non-uniform result. Further discussion at: How to generate random number in Bash?
As an aside, it doesn't seem particularly wise to read the whole of words into memory. Unless you'll be doing a lot of repeat access to this data structure, consider trying to do this without slurping up the whole file at once.

If shuf is available on your system...
r=$(shuf -i 0-${#words[#]} -n 1)
If not, you could use $RANDOM several times and concatenate the results to obtain a number with enough digits to cover your needs. You should concatenate, not add, as adding random numbers will not produce an even distribution (just like throwing two random dies will produce a total of 7 more often than a total of 1).
For instance :
printf -v r1 %05d $RANDOM
printf -v r2 %05d $RANDOM
printf -v r3 %05d $RANDOM
r4=${r1:1}${r2:1}${r3:1}
r=$(( $r4 % ${#words[#]} ))
The printf statements are used to make sure leading zeros are kept ; the -v option is a hidden gem that allows a variable to be assigned the value (which can, among other things, allow the use of eval to be avoided in many useful real-life cases). The first digit in each of r1, r2 and r3 is stripped because it can only be 0, 1, 2 or 3.

The accepted answer will get you ten digits, but for each five-digit prefix, the last five digits may only be in the range 00000-32767.
The number 1234567890, for example, is not a possibility because 67890 > 32767.
That may be fine. Personally I find this option a bit nicer. It gives you numbers 0-1073676289 with no gaps.
big_random=$(expr $RANDOM \* $RANDOM)

Related

Getting first index of bash array [duplicate]

Is there a bash way to get the index of the nth element of a sparse bash array?
printf "%s\t" ${!zArray[#]} | cut -f$N
Using cut to index the indexes of an array seems excessive, especially in reference to the first or last.
If getting the index is only a step towards getting the entry then there is an easy solution: Convert the array into a dense (= non-sparse) array, then access those entries …
sparse=([1]=I [5]=V [10]=X [50]=L)
dense=("${sparse[#]}")
printf %s "${dense[2]}"
# prints X
Or as a function …
nthEntry() {
shift "$1"
shift
printf %s "$1"
}
nthEntry 2 "${sparse[#]}"
# prints X
Assuming (just like you did) that the list of keys "${!sparse[#]}" expands in sorted order (I found neither guarantees nor warnings in bash's manual, therefore I opened another question) this approach can also be used to extract the nth index without external programs like cut.
indices=("${!sparse[#]}")
echo "${indices[2]}"
# prints 10 (the index of X)
nthEntry 2 "${!sparse[#]}"
# prints 10 (the index of X)
If I understood your question correctly, you may use it like this using read:
# sparse array
declare -a arr=([10]="10" [15]="20" [21]="30" [34]="40" [47]="50")
# desired index
n=2
# read all indices into an array
read -ra iarr < <(printf "%s\t" ${!arr[#]})
# fine nth element
echo "${arr[${iarr[n]}]}"
30

Get the index of the first or nth element of sparse bash array

Is there a bash way to get the index of the nth element of a sparse bash array?
printf "%s\t" ${!zArray[#]} | cut -f$N
Using cut to index the indexes of an array seems excessive, especially in reference to the first or last.
If getting the index is only a step towards getting the entry then there is an easy solution: Convert the array into a dense (= non-sparse) array, then access those entries …
sparse=([1]=I [5]=V [10]=X [50]=L)
dense=("${sparse[#]}")
printf %s "${dense[2]}"
# prints X
Or as a function …
nthEntry() {
shift "$1"
shift
printf %s "$1"
}
nthEntry 2 "${sparse[#]}"
# prints X
Assuming (just like you did) that the list of keys "${!sparse[#]}" expands in sorted order (I found neither guarantees nor warnings in bash's manual, therefore I opened another question) this approach can also be used to extract the nth index without external programs like cut.
indices=("${!sparse[#]}")
echo "${indices[2]}"
# prints 10 (the index of X)
nthEntry 2 "${!sparse[#]}"
# prints 10 (the index of X)
If I understood your question correctly, you may use it like this using read:
# sparse array
declare -a arr=([10]="10" [15]="20" [21]="30" [34]="40" [47]="50")
# desired index
n=2
# read all indices into an array
read -ra iarr < <(printf "%s\t" ${!arr[#]})
# fine nth element
echo "${arr[${iarr[n]}]}"
30

How can I assign a range using array length?

This is probably a silly question, more out of curiosity. I have an array in bash:
array=(name1.ext name2.ext name3.ext)
I want to strip off the extension from each element. I was trying to do this by looping over each element, but I was having trouble setting the range of the loop (see below):
for i in 'echo {0..'expr ${#array[#]} - 1'}'; do
newarray[$i]+=$(echo "${array[$i]:0:5}");
done
Please note ' = "back-tick" within the code-block because I wasn't sure how to escape it.
I'm not able to just use a set range (e.g. seq 0 3), because it changes based on the folder, so I wanted to be able to use the length of the array minus 1. I was able to work around this using:
for (( i=0; i<${#array[#]}; i++ )); do
newarray[$i]+=$(echo "${array[$i]:0:5}"); done
But I thought there should be some way to do it with the "array length minus 1" method above and wondered how I was thinking about this incorrectly. Any pointers are appreciated.
Thanks!
Dan
You can apply various parameter expansion operators to each element of an array directly, without needing an explicit loop.
$ array=(name1.ext name2.ext name3.ext)
$ printf '%s\n' "${array[#]%.ext}"
name1
name2
name3
$ newarray=( "${array[#]%.ext}" )
In general, though, there is nearly never any need to generate a range numbers to iterate over. Just use the C-style for loop:
for ((i=0; i< ${#array[#]}; i++ )); do
newarray[$i]="${array[i]%.ext}"
done
With Bash, you could simply loop over your array elements with ${files[#]}:
#!/bin/bash
files=(name1.ext name2.ext name3.ext)
for f in ${files[#]}; do
echo "${f%.*}"
done
Also substring removal ${f%.*} is a better choice if you have extensions of different lengths.
You can use the seq command
for i in `seq 0 $(( ${#array[#]} - 1 ))`
do
···
done
or the bash brace expansion (but in this case you need eval):
for i in `eval echo {0..$(( ${#array[#]} - 1 ))}`
do
···
done
But there is another better (it works even in sparse arrays): let's make bash give us the array indexes:
for i in ${!array[#]}
do
···
done

Which data structure might be a more efficient implementation?

I was doing an exercise on reading from a setup file in which every line specifies two words and a number. The number denotes the number of words in between the two words specified. Another file – input.txt – has a block of text, and the program attempts to count the number of occurrences in the input file which follows the constraints in each line in the setup file (i.e., two particular words a and b should be separated by n words, where a, b and n are specified in the setup file.
So I've tried to do this as a shell script, but my implementation is probably highly inefficient. I used an array to store the words from the setup file, and then did a linear search on the text file to find out the words, and the works. Here's a bit of the code, if it helps:
#!/bin/sh
j=0
count=0;
m=0;
flag=0;
error=0;
while read line; do
line=($line);
a[j]=${line[0]}
b[j]=${line[1]}
num=${line[2]}
c[j]=`expr $num + 0`
j=`expr $j + 1`
done <input2.txt
while read line2; do
line2=($line2)
for (( i=0; $i<=50; i++ )); do
for (( m=0; $m<j; m++)); do
g=`expr $i + ${c[m]}`
g=`expr $g + 1`
if [ "${line2[i]}" == "${a[m]}" ] ; then
for (( k=$i; $k<$g; k++)); do
if [[ "${line2[k]}" == *.* ]]; then
flag=1
break
fi
done
if [ "${b[m]}" == "${line2[g]}" ] ; then
if [ "$flag" == 1 ] ; then
error=`expr $error + 1`
fi
count=`expr $count + 1`
fi
flag=0
fi
if [ "${line2[i]}" == "${b[m]}" ] ; then
for (( k=$i; $k<$g; k++)); do
if [[ "${line2[k]}" == *.* ]]; then
flag=1
break
fi
done
if [ "${a[m]}" == "${line2[g]}" ] ; then
if [ "$flag" == 1 ] ; then
error=`expr $error + 1`
fi
count=`expr $count + 1`
fi
flag=0
fi
done
done
done <input.txt
count=`expr $count - $error`
echo "| Count = $count |"
As you can see, this takes a lot of time.
I was thinking of a more efficient way to implement this, in C or C++, this time. What could be a possible alternative implementation of this, efficiency considered? I thought of hash tables, but could there be a better way?
I'd like to hear what everyone has to say on this.
Here's a fully working possibility. It is not 100% pure bash since it uses (GNU) sed: I'm using sed to lowercase everything and to get rid of punctuation marks. Maybe you won't need this. Adapt to your needs.
#!/bin/bash
input=input.txt
setup=setup.txt
# The Check function
Check() {
# $1 is word1
# $2 is word2
# $3 is number of words between word1 and word2
nb=0
# Get all positions of w1
IFS=, read -a q <<< "${positions[$1]}"
# Check, for each position, if word2 is at distance $3 from word1
for i in "${q[#]}"; do
[[ ${words[$i+$3+1]} = $2 ]] && ((++nb))
done
echo "$nb"
}
# Slurp input file in an array
words=( $(sed 's/[,.:!?]//g;s/\(.*\)/\L\1/' -- "$input") )
# For each word, specify its positions in file
declare -A positions
pos=0
for i in "${words[#]}"; do
positions[$i]+=$((pos++)),
done
# Do it!
while read w1 w2 p; do
# Check that w1 w2 are not empty
[[ -n $w2 ]] || continue
# Check that p is a number
[[ $p =~ ^[[:digit:]]+$ ]] || continue
n=$(Check "$w1" "$w2" "$p")
[[ $w1 != $w2 ]] && (( n += $(Check "$w2" "$w1" "$p") ))
echo "$w1 $w2 $p: $n"
done < <(sed 's/\(.*\)/\L\1/' -- "$setup")
How does it work:
we first read the whole file input.txt in the array words: a word per field. Observe I'm using sed here to delete all punctuation marks (well, only ,, ., :, !, ?, for testing purposes, add some more if you wish) and to lowercase every letter.
Loop through the array words and for each word, put its position in an associative array positions:
w => "position1,position2,...,positionk,"
Finally, we read the setup.txt file (filtered through sed again to lowercase everything – optional see below). Do a quick check whether the line is valid (2 words and a number) and then call the Check function (twice, for each permutation of the given words, unless both words are equal).
The Check function finds all positions of word1 in file, thanks to associative array positions and then using the array words, check whether word2 is at the given "distance" from word1.
The second sed is optional. I've filtered the setup.txt file through sed to lowercase everything. This sed will leave only very little overhead, so, efficiency-wise, it's not a big deal. You'll be able to add more filtering later to make sure the data is consistent with how the script will use it (e.g., get rid of punctuation marks). Otherwise you could:
Get rid of it altogether: replace the corresponding line (the last line) with just
done < "$setup"
In this case, you'll have to trust the guy/gal who will write the setup.txt file.
Get rid of it as above, but still want to convert everything to lowercase. In this case, below the
while read w1 w2 p; do
line, just add these lines:
w1=${w1,,}
w2=${w2,,}
That's the bash way to lowercase a string.
Caveats. The script will break if:
The number given in setup.txt file starts with a 0 and contains an 8 or a 9. This is because bash will consider it's an octal number, where 8's and 9's are not valid. There are workarounds for this.
The text in input.txt doesn't follow proper typographical practices: a punctuation mark is always followed by a space. E.g., if the input file contains
The quick,brown,dog jumps over the lazy fox
then after the sed treatment the text will look like
The quickbrowndog jumps over the lazy fox
and the words quick, brown and dog won't be treated properly. You can replace the sed substitution s/[,:!?]//g with s/[,:!?]/ /g to convert these symbols with a space. It's up to you, but in that case, abbreviations as, e.g., e.g. and i.e. might not be considered properly… it now really depends what you need to do.
Different character encodings are used… I don't really know how robust you need the script to be, and what languages and encodings you'll consider.
(Add stuff here :).)
About efficiency. I'd say the algorithm is rather efficient. bash is probably not the best suited language for that, but it's a lot of fun, and not that difficult after all if we look at it (less than 20 lines of relevant code, and even less than that!). If you only have 50 files with 50000 words, it's ok, you will not notice too much difference between bash and perl/python/awk/C/you-name-it: bash performs decently quickly for files of this type. Now if you have 100000 files each containing millions of words, well, a different approach should be taken and a different language should be used (but I don't know which one).
If:
it can get complex for the sake of efficiency
the text file can be large
the setup file can have many rows
then I would do it the following way:
As preparation I would create:
A hash map with the index of the word as key and the word as the value (named -say- WORDS). So WORDS[1] would be the first word, WORDS[2] the second, and so on.
A hashmap with the words as keys and the list of indexes as values (named -say- INDEXES). So if WORDS[2] and WORDS[5] is "dog" and none other, than INDEXES["dog"] would yield the numers 2 and 5. The value can be a dynamic indexed array or a linked list. Linked list is better if there are words that occur many times.
You can read the text file, and populate both structures at the same time.
Processing:
For each row of the setup file I would get the indexes in INDEXES[firstword] and check if WORDS[index + wordsinbetween + 1] equals with secondword. If it does, that's a hit.
Notes:
Preparation: You only read the text file once. For each word in the text file, you're doing fast operations thats' performance is not really effected by the amount of words already processed.
Processing: You only read the setup file once. For each row you're here too doing operations that are only effected by the number of occurences of firstword in the text file.

Writing a program in bash shell in UNIX

I have to write a program that lets the user enter in as many numbers as he wants and determine which is the largest, smallest, what the sum is, and the average of all the numbers entered. Am I forced to use arrays to do this or is there another way? If I have to use an array, can someone help me out with an example of how I should be approaching this question? Thanks
You do not need an array. Just keep the largest and smallest number so far, the count of numbers and the sum. The average is simply sum/count.
To read the input, you can use read in a while loop.
Simple straight forward attempt, with some issues:
#!/bin/bash
echo "Enter some numbers separated by spaces"
read numbers
# Initialise min and max with first number in sequence
for i in $numbers; do
min=$i
max=$i
break
done
total=0
count=0
for i in $numbers; do
total=$((total+i))
count=$((count+1))
if test $i -lt $min; then min=$i; fi
if test $i -gt $max; then max=$i; fi
done
echo "Total: $total"
echo "Avg: $((total/count))"
echo "Min: $min"
echo "Max: $max"
Also tested with /bin/sh, so you don't actually need bash, which is a much larger shell. Also note that this only works with integers, and average is truncated (not rounded).
For floating point, you could use bc. But instead of dropping into a different interpreter multiple times, why not just write it in something a bit more suitable to the problem, such as python or perl, eg in python:
import sys
from functools import partial
sum = partial(reduce, lambda x, y: x+y)
avg = lambda l: sum(l)/len(l)
numbers = sys.stdin.readline()
numbers = [float(x) for x in numbers.split()]
print "Sum: " + str(sum(numbers))
print "Avg: " + str(avg(numbers))
print "Min: " + str(min(numbers))
print "Max: " + str(max(numbers))
You could embed it in bash using a here document, see this question: How to pipe a here-document through a command and capture the result into a variable?

Resources