I was doing an exercise on reading from a setup file in which every line specifies two words and a number. The number denotes the number of words in between the two words specified. Another file – input.txt – has a block of text, and the program attempts to count the number of occurrences in the input file which follows the constraints in each line in the setup file (i.e., two particular words a and b should be separated by n words, where a, b and n are specified in the setup file.
So I've tried to do this as a shell script, but my implementation is probably highly inefficient. I used an array to store the words from the setup file, and then did a linear search on the text file to find out the words, and the works. Here's a bit of the code, if it helps:
#!/bin/sh
j=0
count=0;
m=0;
flag=0;
error=0;
while read line; do
line=($line);
a[j]=${line[0]}
b[j]=${line[1]}
num=${line[2]}
c[j]=`expr $num + 0`
j=`expr $j + 1`
done <input2.txt
while read line2; do
line2=($line2)
for (( i=0; $i<=50; i++ )); do
for (( m=0; $m<j; m++)); do
g=`expr $i + ${c[m]}`
g=`expr $g + 1`
if [ "${line2[i]}" == "${a[m]}" ] ; then
for (( k=$i; $k<$g; k++)); do
if [[ "${line2[k]}" == *.* ]]; then
flag=1
break
fi
done
if [ "${b[m]}" == "${line2[g]}" ] ; then
if [ "$flag" == 1 ] ; then
error=`expr $error + 1`
fi
count=`expr $count + 1`
fi
flag=0
fi
if [ "${line2[i]}" == "${b[m]}" ] ; then
for (( k=$i; $k<$g; k++)); do
if [[ "${line2[k]}" == *.* ]]; then
flag=1
break
fi
done
if [ "${a[m]}" == "${line2[g]}" ] ; then
if [ "$flag" == 1 ] ; then
error=`expr $error + 1`
fi
count=`expr $count + 1`
fi
flag=0
fi
done
done
done <input.txt
count=`expr $count - $error`
echo "| Count = $count |"
As you can see, this takes a lot of time.
I was thinking of a more efficient way to implement this, in C or C++, this time. What could be a possible alternative implementation of this, efficiency considered? I thought of hash tables, but could there be a better way?
I'd like to hear what everyone has to say on this.
Here's a fully working possibility. It is not 100% pure bash since it uses (GNU) sed: I'm using sed to lowercase everything and to get rid of punctuation marks. Maybe you won't need this. Adapt to your needs.
#!/bin/bash
input=input.txt
setup=setup.txt
# The Check function
Check() {
# $1 is word1
# $2 is word2
# $3 is number of words between word1 and word2
nb=0
# Get all positions of w1
IFS=, read -a q <<< "${positions[$1]}"
# Check, for each position, if word2 is at distance $3 from word1
for i in "${q[#]}"; do
[[ ${words[$i+$3+1]} = $2 ]] && ((++nb))
done
echo "$nb"
}
# Slurp input file in an array
words=( $(sed 's/[,.:!?]//g;s/\(.*\)/\L\1/' -- "$input") )
# For each word, specify its positions in file
declare -A positions
pos=0
for i in "${words[#]}"; do
positions[$i]+=$((pos++)),
done
# Do it!
while read w1 w2 p; do
# Check that w1 w2 are not empty
[[ -n $w2 ]] || continue
# Check that p is a number
[[ $p =~ ^[[:digit:]]+$ ]] || continue
n=$(Check "$w1" "$w2" "$p")
[[ $w1 != $w2 ]] && (( n += $(Check "$w2" "$w1" "$p") ))
echo "$w1 $w2 $p: $n"
done < <(sed 's/\(.*\)/\L\1/' -- "$setup")
How does it work:
we first read the whole file input.txt in the array words: a word per field. Observe I'm using sed here to delete all punctuation marks (well, only ,, ., :, !, ?, for testing purposes, add some more if you wish) and to lowercase every letter.
Loop through the array words and for each word, put its position in an associative array positions:
w => "position1,position2,...,positionk,"
Finally, we read the setup.txt file (filtered through sed again to lowercase everything – optional see below). Do a quick check whether the line is valid (2 words and a number) and then call the Check function (twice, for each permutation of the given words, unless both words are equal).
The Check function finds all positions of word1 in file, thanks to associative array positions and then using the array words, check whether word2 is at the given "distance" from word1.
The second sed is optional. I've filtered the setup.txt file through sed to lowercase everything. This sed will leave only very little overhead, so, efficiency-wise, it's not a big deal. You'll be able to add more filtering later to make sure the data is consistent with how the script will use it (e.g., get rid of punctuation marks). Otherwise you could:
Get rid of it altogether: replace the corresponding line (the last line) with just
done < "$setup"
In this case, you'll have to trust the guy/gal who will write the setup.txt file.
Get rid of it as above, but still want to convert everything to lowercase. In this case, below the
while read w1 w2 p; do
line, just add these lines:
w1=${w1,,}
w2=${w2,,}
That's the bash way to lowercase a string.
Caveats. The script will break if:
The number given in setup.txt file starts with a 0 and contains an 8 or a 9. This is because bash will consider it's an octal number, where 8's and 9's are not valid. There are workarounds for this.
The text in input.txt doesn't follow proper typographical practices: a punctuation mark is always followed by a space. E.g., if the input file contains
The quick,brown,dog jumps over the lazy fox
then after the sed treatment the text will look like
The quickbrowndog jumps over the lazy fox
and the words quick, brown and dog won't be treated properly. You can replace the sed substitution s/[,:!?]//g with s/[,:!?]/ /g to convert these symbols with a space. It's up to you, but in that case, abbreviations as, e.g., e.g. and i.e. might not be considered properly… it now really depends what you need to do.
Different character encodings are used… I don't really know how robust you need the script to be, and what languages and encodings you'll consider.
(Add stuff here :).)
About efficiency. I'd say the algorithm is rather efficient. bash is probably not the best suited language for that, but it's a lot of fun, and not that difficult after all if we look at it (less than 20 lines of relevant code, and even less than that!). If you only have 50 files with 50000 words, it's ok, you will not notice too much difference between bash and perl/python/awk/C/you-name-it: bash performs decently quickly for files of this type. Now if you have 100000 files each containing millions of words, well, a different approach should be taken and a different language should be used (but I don't know which one).
If:
it can get complex for the sake of efficiency
the text file can be large
the setup file can have many rows
then I would do it the following way:
As preparation I would create:
A hash map with the index of the word as key and the word as the value (named -say- WORDS). So WORDS[1] would be the first word, WORDS[2] the second, and so on.
A hashmap with the words as keys and the list of indexes as values (named -say- INDEXES). So if WORDS[2] and WORDS[5] is "dog" and none other, than INDEXES["dog"] would yield the numers 2 and 5. The value can be a dynamic indexed array or a linked list. Linked list is better if there are words that occur many times.
You can read the text file, and populate both structures at the same time.
Processing:
For each row of the setup file I would get the indexes in INDEXES[firstword] and check if WORDS[index + wordsinbetween + 1] equals with secondword. If it does, that's a hit.
Notes:
Preparation: You only read the text file once. For each word in the text file, you're doing fast operations thats' performance is not really effected by the amount of words already processed.
Processing: You only read the setup file once. For each row you're here too doing operations that are only effected by the number of occurences of firstword in the text file.
Related
I'm unable to find any for loop syntax that works with ksh for me. I'm wanting to create a program that esentially will add up numbers that are assigned to letters when a user inputs a word into the program. I want the for loop to read the first character, grab the number value of the letter, and then add the value to a variable that starts out equaled to 0. Then the next letter will get added to the variable's value and so on until there are no more letters and there will be a variable from the for loop that equals a value.
I understand I more than likely need an array that specifies what a letter's (a-z) value would be (1-26)...which I am finding difficult to figure that out as well. Or worst case I figure out the for loop and then make about 26 if statements saying something like if the letter equals c, add 3 to the variable.
So far I have this (which is pretty bare bones):
#!/bin/ksh
typeset -A my_array
my_array[a]=1
my_array[b]=2
my_array[c]=3
echo "Enter a word: \c"
read work
for (( i=0; i<${work}; i++ )); do
echo "${work:$i:1}"
done
Pretty sure this for loop is bash and not ksh. And the array returns an error of typeset: bad option(s) (I understand I haven't specified the array in the for loop).
I want the array letters (a, b, c, etc) to correspond to a value such as a = 1, b = 2, c = 3 and so on. For example the word is 'abc' and so it would add 1 + 2 + 3 and the final output will be 6.
You were missing the pound in ${#work}, which expands to 'length of ${work}'.
#!/bin/ksh
typeset -A my_array
my_array[a]=1
my_array[b]=2
my_array[c]=3
read 'work?enter a word: '
for (( i=0; i<${#work}; i++ )); do
c=${work:$i:1} w=${my_array[${work:$i:1}]}
echo "$c: $w"
done
ksh2020 supports read -p PROMPT syntax, which would make this script 100% compatible with bash. ksh93 does not. You could also use printf 'enter a word: ', which works in all three (ksh2020, ksh93, bash) and zsh.
Both ksh2020 and ksh93 understand read var?prompt which I used above.
First check the input work, you only want {a..z}.
charwork=$(tr -cd "[a-z]" <<< "$work")
Next you can fill 2 arrays with corresponding values
a=( {a..z} )
b=( {1..26} )
Using these arrays you can make a file with sed commands
for ((i=0;i<26;i++)); do echo "s/${a[i]}/${b[i]}+/g"; done > replaceletters.sed
# test it
echo "abcz" | sed -f replaceletters.sed
# result: 1+2+3+26+
Before you can pipe this into bc, use sed to remove the last + character.
Append s/+$/\n/ to replaceletters.sed and bc can calculate it.
Now you can use sed for replacing letters by digits and insert + signs.
Combining the steps, you have
a=( {a..z} )
b=( {1..26} )
tr -cd "[a-z]" <<< "$work" |
sed -f <(
for ((i=0;i<26;i++)); do
echo "s/${a[i]}/${b[i]}+/g"
done
echo 's/+$/\n/'
) | bc
In the loop you can use $i and avoid array b, but remember that the array start with 0, so a[5] corresponds with 6.
I wrote a bash script that reads a file from stdin $1, and needs to read that file line by line within a loop, and based on a condition statement in each iteration, each line tested from the file will feed into one of two new arrays lets say named GOOD array and BAD array. Lastly, I'll display the total elements of each array.
#!/bin/bash
for x in $(cat $1); do
#testing something on x
if [ $? -eq 0 ]; then
#add the current value of x into array called GOOD
else
#add the current value of x into array called BAD
fi
done
echo "Total GOOD elements: ${#GOOD[#]}"
echo "Total BAD elements: ${#BAD[#]}"
What changes should i make to accomplish it?
#!/usr/bin/env bash
# here, we're checking the number of lines more than 5 characters long
# replace with your real test
testMyLine() { (( ${#1} > 5 )); }
good=( ); bad=( )
while IFS= read -r line; do
if testMyLine "$line"; then
good+=( "$line" )
else
bad+=( "$line" )
fi
done <"$1"
echo "Read ${#good[#]} good and ${#bad[#]} bad lines"
Note:
We're using a while read loop to iterate over file contents. This doesn't need to read more than one line into memory at a time (so it won't run out of RAM even with really big files), and it doesn't have unwanted side effects like changing a line containing * to a list of files in the current directory.
We aren't using $?. if foo; then is a much better way to branch on the exit status of foo than foo; if [ $? = 0 ]; then -- in particular, this avoids depending on the value of $? not being changed between when you assign it and when you need it; and it marks foo as "checked", to avoid exiting via set -e or triggering an ERR trap when your boolean returns false.
The use of lower-case variable names is intentional. All-uppercase names are used for shell-builtin variables and names with special meaning to the operating system -- and since defining a regular shell variable overwrites any environment variable with the same name, this convention applies to both types. See http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html
I have a text file that has 8000 lines, here is an example
00122;IL;Chicago;Router;;1496009459
00133;IL;Chicago;Router;0;6.651;1496009460
00166;IL;Chicago;Router;0;5.798;1496009460
00177;IL;Chicago;Router;0;5.365;1496009460
00188;IL;Chicago;Router;0;22.347;1496009460
As you can see the file has different count of delimiter, I need to insert all columns separated by ';' to an array no matter when the the delimiter occurs
So the first line would have 6 fields and the second line would have 7.
When I tried do it through the below command
Number=( $(awk '{print $1}' $FileName.txt) ) with different array name and field for each columns, I am getting strange behavior which not all fields are printed for some lines when I echo them all in one line
Performance is very important (need to do it in a matter of seconds )and I found using awk is the fastest approach so far, unless someone has better approach.
An ideas why this is happening ?
To dump the entire text file into an array, I would use the following. In this example, we use the two arrays ${finalarray[]} and ${subarray[]} (though the latter is unset at the end) and the variable $line. We assume the file name is file.txt.
#!/bin/bash
finalarray=()
while read line; do #For each cycle, the variable $line is the next line of the file
if [[ -z $line ]]; then continue; done #If the line is empty, skip this cycle
IFS=";" read -r -a subarray <<< "$line" #Split $line into ${subarray[]} using : as delim
finalarray+=( "$subarray[#]}" ) #Add every element from ${subarray[]} to ${finalarray[]}
unset subarray #clears the array
done <file.txt
If your empty lines are, in fact, populated by spaces or other whitespace characters, the empty line catch won't work. Instead, you could use something like the following to skip any lines not containing semicolons.
if [[ $(echo "$line" | grep -c ";") -eq 0 ]]; then continue; fi
On the other hand, this would skip all lines without a semicolon, even if you intended some of those lines to be a single array entry.
This question already has answers here:
A variable modified inside a while loop is not remembered
(8 answers)
Closed 6 years ago.
I have a pretty simple sh script where I make a system cat call, collect the results and parse some relevant information before storing the information in an array, which seems to work just fine. But as soon as I exit the for loop where I store the information, the array seems to clear itself. I'm wondering if I am accessing the array incorrectly outside of the for loop. Relevant portion of my script:
#!/bin/sh
declare -a QSPI_ARRAY=()
cat /proc/mtd | while read mtd_instance
do
# split result into individiual words
words=($mtd_instance)
for word in "${words[#]}"
do
# check for uboot
if [[ $word == *"uboot"* ]]
then
mtd_num=${words[0]}
index=${mtd_num//[!0-9]/} # strip everything except the integers
QSPI_ARRAY[$index]="uboot"
echo "QSPI_ARRAY[] at index $index: ${QSPI_ARRAY[$index]}"
elif [[ $word == *"fpga_a"* ]]
then
echo "found it: "$word""
mtd_num=${words[0]}
index=${mtd_num//[!0-9]/} # strip everything except the integers
QSPI_ARRAY[$index]="fpga_a"
echo "QSPI_ARRAY[] at index $index: ${QSPI_ARRAY[$index]}"
# other items are added to the array, all successfully
fi
done
echo "length of array: ${#QSPI_ARRAY[#]}"
echo "----------------------"
done
My output is great until I exit the for loop. While within the for loop, the array size increments and I can check that the item has been added. After the for loop is complete I check the array like so:
echo "RESULTING ARRAY:"
echo "length of array: ${#QSPI_ARRAY[#]}"
for qspi in "${QSPI_ARRAY}"
do
echo "qspi instance: $qspi"
done
Here are my results, echod to my display:
dev: size erasesize name
length of array: 0
-------------
mtd0: 00100000 00001000 "qspi-fsbl-uboot"
QSPI_ARRAY[] at index 0: uboot
length of array: 1
-------------
mtd1: 00500000 00001000 "qspi-fpga_a"
QSPI_ARRAY[] at index 1: fpga_a
length of array: 2
-------------
RESULTING ARRAY:
length of array: 0
qspi instance:
EDIT: After some debugging, it seems I have two different arrays here somehow. I initialized the array like so: QSPI_ARRAY=("a" "b" "c" "d" "e" "f" "g"), and after my for-loop for parsing the array it is still a, b, c, etc. How do I have two different arrays of the same name here?
This structure:
cat /proc/mtd | while read mtd_instance
do
...
done
Means that whatever comes between do and done cannot have any effects inside the shell environment that are still there after the done.
The fact that the while loop is on the right hand side of a pipe (|) means that it runs in a subshell. Once the loop exits, so does the subshell. And all of its variable settings.
If you want a while loop which makes changes that stick around, don't use a pipe. Input redirection doesn't create a subshell, and in this case, you can just read from the file directly:
while read mtd_instance
do
...
done </proc/mtd
If you had a more complicated command than a cat, you might need to use process substitution. Still using cat as an example, that looks like this:
while read mtd_instance
do
...
done < <(cat /proc/mtd)
In the specific case of your example code, I think you could simplify it somewhat, perhaps like this:
#!/usr/bin/env bash
QSPI_ARRAY=()
while read -a words; do␣
declare -i mtd_num=${words[0]//[!0-9]/}
for word in "${words[#]}"; do
for type in uboot fpga_a; do
if [[ $word == *$type* ]]; then
QSPI_ARRAY[mtd_num]=$type
break 2
fi
done
done
done </proc/mtd
Is this potentially what you are seeing:
http://mywiki.wooledge.org/BashFAQ/024
I'm struggling with a project. I am supposed to write a bash script which will work like tr command. At the beginning I would like to save all commands arguments into separated arrays. And in case if an argument is a word I would like to have each char in separated array field,eg.
tr_mine AB DC
I would like to have two arrays: a[0] = A, a[1] = B and b[0]=C b[1]=D.
I found a way, but it's not working:
IFS="" read -r -a array <<< "$a"
No sed, no awk, all bash internals.
Assuming that words are always separated with blanks (space and/or tabs),
also assuming that words are given as arguments, and writing for bash only:
#!/bin/bash
blank=$'[ \t]'
varname='A'
n=1
while IFS='' read -r -d '' -N 1 c ; do
if [[ $c =~ $blank ]]; then n=$((n+1)); continue; fi
eval ${varname}${n}'+=("'"$c"'")'
done <<<"$#"
last=$(eval echo \${#${varname}${n}[#]}) ### Find last character index.
unset "${varname}${n}[$last-1]" ### Remove last (trailing) newline.
for ((j=1;j<=$n;j++)); do
k="A$j[#]"
printf '<%s> ' "${!k}"; echo
done
That will set each array A1, A2, A3, etc. ... to the letters of each word.
The value at the end of the first loop of $n is the count of words processed.
Printing may be a little tricky, that is why the code to access each letter is given above.
Applied to your sample text:
$ script.sh AB DC
<A> <B>
<D> <C>
The script is setting two (array) vars A1 and A2.
And each letter is one array element: A1[0] = A, A1[1] = B and A2[0]=C, A2[1]=D.
You need to set a variable ($k) to the array element to access.
For example, to echo fourth letter (0 based) of second word (1 based) you need to do (that may be changed if needed):
k="A2[3]"; echo "${!k}" ### Indirect addressing.
The script will work as this:
$ script.sh ABCD efghi
<A> <B> <C> <D>
<e> <f> <g> <h> <i>
Caveat: Characters will be split even if quoted. However, quoted arguments is the correct way to use this script to avoid the effect of shell metacharacters ( |,&,;,(,),<,>,space,tab ). Of course, spaces (even if repeated) will split words as defined by the variable $blank:
$ script.sh $'qwer;rttt fgf\ngfg'
<q> <w> <e> <r> <;> <r> <t> <t> <t>
<>
<>
<>
<f> <g> <f> <
> <g> <f> <g>
As the script will accept and correctly process embebed newlines we need to use: unset "${varname}${n}[$last-1]" to remove the last trailing "newline". If that is not desired, quote the line.
Security Note: The eval is not much of a problem here as it is only processing one character at a time. It would be difficult to create an attack based on just one character. Anyway, the usual warning is valid: Always sanitize your input before using this script. Also, most (not quoted) metacharacters of bash will break this script.
$ script.sh qwer(rttt fgfgfg
bash: syntax error near unexpected token `('
I would strongly suggest to do this in another language if possible, it will be a lot easier.
Now, the closest I come up with is:
#!/bin/bash
sentence="AC DC"
words=`echo "$sentence" | tr " " "\n"`
# final array
declare -A result
# word count
wc=0
for i in $words; do
# letter count in the word
lc=0
for l in `echo "$i" | grep -o .`; do
result["w$wc-l$lc"]=$l
lc=$(($lc+1))
done
wc=$(($wc+1))
done
rLen=${#result[#]}
echo "Result Length $rLen"
for i in "${!result[#]}"
do
echo "$i => ${result[$i]}"
done
The above prints:
Result Length 4
w1-l1 => C
w1-l0 => D
w0-l0 => A
w0-l1 => C
Explanation:
Dynamic variables are not supported in bash (ie create variables using variables) so I am using an associative array instead (result)
Arrays in bash are single dimension. To fake a 2D array I use the indexes: w for words and l for letters. This will make further processing a pain...
Associative arrays are not ordered thus results appear in random order when printing
${!result[#]} is used instead of ${result[#]}. The first iterates keys while the second iterates values
I know this is not exactly what you ask for, but I hope it will point you to the right direction
Try this :
sentence="$#"
read -r -a words <<< "$sentence"
for word in ${words[#]}; do
inc=$(( i++ ))
read -r -a l${inc} <<< $(sed 's/./& /g' <<< $word)
done
echo ${words[1]} # print "CD"
echo ${l1[1]} # print "D"
The first read reads all words, the internal one is for letters.
The sed command add a space after each letters to make the string splittable by read -a. You can also use this sed command to remove unwanted characters from words (eg commas) before splitting.
If special characters are allowed in words, you can use a simple grep instead of the sed command (as suggested in http://www.unixcl.com/2009/07/split-string-to-characters-in-bash.html) :
read -r -a l${inc} <<< $(grep -o . <<< $word)
The word array is ${w}.
The letters arrays are named l# where # is an increment added for each word read.