awk through text file with different delimiter count into array - arrays

I have a text file that has 8000 lines, here is an example
00122;IL;Chicago;Router;;1496009459
00133;IL;Chicago;Router;0;6.651;1496009460
00166;IL;Chicago;Router;0;5.798;1496009460
00177;IL;Chicago;Router;0;5.365;1496009460
00188;IL;Chicago;Router;0;22.347;1496009460
As you can see the file has different count of delimiter, I need to insert all columns separated by ';' to an array no matter when the the delimiter occurs
So the first line would have 6 fields and the second line would have 7.
When I tried do it through the below command
Number=( $(awk '{print $1}' $FileName.txt) ) with different array name and field for each columns, I am getting strange behavior which not all fields are printed for some lines when I echo them all in one line
Performance is very important (need to do it in a matter of seconds )and I found using awk is the fastest approach so far, unless someone has better approach.
An ideas why this is happening ?

To dump the entire text file into an array, I would use the following. In this example, we use the two arrays ${finalarray[]} and ${subarray[]} (though the latter is unset at the end) and the variable $line. We assume the file name is file.txt.
#!/bin/bash
finalarray=()
while read line; do #For each cycle, the variable $line is the next line of the file
if [[ -z $line ]]; then continue; done #If the line is empty, skip this cycle
IFS=";" read -r -a subarray <<< "$line" #Split $line into ${subarray[]} using : as delim
finalarray+=( "$subarray[#]}" ) #Add every element from ${subarray[]} to ${finalarray[]}
unset subarray #clears the array
done <file.txt
If your empty lines are, in fact, populated by spaces or other whitespace characters, the empty line catch won't work. Instead, you could use something like the following to skip any lines not containing semicolons.
if [[ $(echo "$line" | grep -c ";") -eq 0 ]]; then continue; fi
On the other hand, this would skip all lines without a semicolon, even if you intended some of those lines to be a single array entry.

Related

Bash - Looping through the content of file and perform actions on rows and columns

The structure of the my input files are as follows:
<string1> <string2> <stringN>
hello nice world
one three
NOTE:, the second row has a tab/null on the second column. so second column on second row is empty and not 'three'
In bash, I want to loop through each row and also be able to process each individual column (string[1-N])
I am able to iterate to each row:
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
line=${line/$/'\t'/,}
read -r -a columns <<< "$line"
echo "current Row: $line"
echo "column[1]: '${columns[1]}'"
#echo "column[N] '${columns[N]}'"
done < "${1}"
Expected result:
current Row: hello,nice,world
column[1]: 'nice'
current Row: one,,three
column[1]: ''
Basically what I do is iterate through the input file (here passed as argument), do all the "cleaning" like prevents whitespace from being trimmed, ignore backslashes an consider also the last line.
then I replace the tabs '\t' by a comma
and finally read the line into an array (columns) to be able to select a particular column.
The input file has tabs as separator value, so I tried to convert it to csv format, I am not sure if the regex I use is correct in bash, or something else is wrong because this does not return a value in the array.
Thanks
You are almost there, a little fix on the on translating '\t' to commas and you have to set also IFS to be the comma.
try this:
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
line=${line//$'\t'/,}
IFS=',' read -r -a columns <<< "$line"
#echo "current Row: $line"
echo "column[0]:'${columns[0]}' column[1]:'${columns[1]}' column[2]:'${columns[2]}'"
done < "${1}"
run:
$> <the_script> <the_file>
Outputs:
column[0]:'hello' column[1]:'nice' column[2]:'world '
column[0]:'one' column[1]:'' column[2]:'three'

Bash Convert text string into array with multiple \r\n as field seperator

I have a windows text file in the format:
line\r\n
line\r\n
line\r\n
r\n
line\r\n
line\r\n
line\r\n
r\n
...
I want to put this textfile into an array where the field seperator is \r\n\r\n - I did search for an answer but nothing I found and tried did work . awk for example is too complex for me and FS= did not work as I expected.
Commands to read arrays in bash can (as far as I know) only use single characters as a field separator, not complete strings like \r\n\r\n.
Workaround
First replace the field separator \r\n\r\n with a single char which is not used in the string to be splitted. I found \x1e (the ASCII control character »Record Separator«) to work out quite well.
Then read the array using the new (one character) field separator.
The field separator will always be removed when reading something to an array. But you can append the separator to each field.
Here is a pure bash solution to read the file file into the array array:
IFS=$'\x1e'
filecontent="$(< file)"
array=(${filecontent//$'\r\n\r\n'/$'\x1e'})
array=("${array[#]/%/$'\r\n\r\n'}")
IFS=$'\x1e' sets bash's field separator which is used to split strings into arrays. Depending on your script you may want to restore the old IFS afterwards (default is IFS=$' \t\n').
Results
For file
A B C\r\n
D E F\r\n
\r\n
G H I\r\n
\r\n
the resulting array will have two entries:
${array[0]}
A B C\r\n
D E F\r\n
\r\n
${array[1]}
G H I\r\n
\r\n
Known Problems
IFS at the beginning and end of the string will be trimmed. Repeated IFS will be squeezed. The file \r\n\r\n will result in an array without entries. Empty entries cannot be created.
\r\n\r\n is appended to all entries in all cases. The file A\r\n\r\nB will result an array with the two entries A\r\n\r\n and B\r\n\r\n.
In Linux all lines of files are terminated with \n.
So your problem is not the \r\n , it is just the \r. So just remove it:
$ tr -d '\r' <file >newfile
To verify that \r is removed you can do:
$ head -n2 newfile |od -t x1c
This will get the first two lines of the new file and the od tool will dump / convert those lines in ascii hex codes. In ascii hex \r is \x0d and \n is \x0a.
Once you have removed the \r from your file you can do anything you want.
You can use all linux tools (including awk) straight forward without special settings.
To built an array you can use:
$ while read -r line;do data+=("$line");done <newfile
If you want to skip blank lines , this one is enough:
$ while read -r line;do [[ "$line" == "" ]] && continue;data+=("$line") ;done <file1
You can offcourse combine array creation with removal of the \r on-the-fly, without modifying your existed file like this ( See online testing here. )
while read -r line;do [[ "$line" == "" ]] && continue;data+=("$line") ;done < <(tr -d '\r' <file1)
To see what is inside array "data" just use $ declare -p data
PS: By the way using awk -v RS="\r\n" '{you awk code here}' should be enough even to read the initial file in awk as well. RS = Record (lines) Separator
I made this script in pure bash, even if the answer from socowi is pure bash too:
exec < filern.txt
declare -a array
acc=""
lineno=0
cr=$(echo -en "\r")
while read line; do
line=${line%$cr}
if [ -z "$line" ]; then
let lineno=$lineno+1
array[$lineno]=$acc
acc=""
else
[ ! -z "$acc" ] && acc="$acc--" # you can use any separator here
acc="$acc$line"
fi
done
echo "Read file in array:"
for ((i=1; i<= ${#array[#]}; i++)) do
printf "%3.3d |%s|\n" $i "${array[$i]}"
done
It reads a "real" line of input at a time, and strips the trailing \r.
At this point, a sequence \r\n\r\n turns into an empty line, so that is used to assign the array elements one after the other.
The output from the example file is:
Read file in array:
001 |line--line--line|
002 |line--line--line|
The separator could also be a \r, or whatever. I coudn't find a way to clear the trailing \r with the command line=${line% ?? }, so I used a variable. The same trick can be used to add "strange" separator to the variable ACC. I hope it helps.

Splitting a list of words from a text file into an array in BASH

I've been fighting with this a lot this past hour or so, I'm trying to get a text file filled with names into an array of those words.
The text file is formated:
Word1~~Word2
Word1~~Word2
..
Eventually I want to split up these words into 2 arrays of word1 and word2 breaking at the "~~" but that is a problem for later
Right now I (currently) have this:
#!/bin/bash
a=$(cat ~/words.txt)
c=0
for word in $a
do
arrayone[$c]=(echo $word)
c=$((c+1))
done
I've tried many many other ways and all either don't work or error out upon execution, I'm relatively new at BASH and am having extreme difficult with the syntax
Thank you for your time
It's actually just as easy to solve your "later" problem right now. Unless you need to be able to deal with words with (unpaired) literal ~ characters, this will do:
declare -a arr1=() arr2=()
while IFS='~' read -r word1 _ word2; do
arr1+=( "$word1" )
arr2+=( "$word2" )
done <file
printf 'Words in column 1:\n'
printf ' %s\n' "${arr1[#]}"
printf 'Words in column 2:\n'
printf ' %s\n' "${arr2[#]}"
If you do need to deal with the more interesting case, treating only a double tilde as special, one approach to doing so is with a regex match:
while IFS='' read -r line; do
if [[ $line =~ (.*)[~][~](.*) ]]; then
arr1+=( "${BASH_REMATCH[1]}" )
arr2+=( "${BASH_REMATCH[2]}" )
else
printf 'Line does not match pattern: %s\n' "$line" >&2
fi
done <file

Check if each element of an array is present in a string in bash, ignoring certain characters and order

On the web I found answers to find if an element of array is present in the string. But I want to find if each element in the array is present in the string.
eg. str1 = "This_is_a_big_sentence"
Initially str2 was like
str2 = "Sentence_This_big"
Now I wanted to search if string str1 contains "sentence"&"this"&"big" (All 3, ignore alphabetic order and case)
So I used arr=(${str2//_/ })
How do i proceed now, I know comm command finds intersection, but it needs a sorted list, also I need to ignore _ underscores.
I get my str2 by finding the extension of a particular type of file using the command
for i in `ls snooze.*`; do echo $i | cut -d "." -f2
# Till here i get str2 and need to check as mentioned above. Not sure how to do this, i tried putting str2 as array and now just need to check if all elements of my array occur in str1 (ignore case,order)
Any help would be highly appreciated. I did try to use This link
Now I wanted to search if string a contains "sentence"&"this"&"big"
(All 3, ignore alphabatic order and case)
Here is one approach:
#!/bin/bash
str1="This_is_a_big_sentence"
str2="Sentence_This_big"
if ! grep -qvwFf <(sed 's/_/\n/g' <<<${str1,,}) <(sed 's/_/\n/g' <<<${str2,,})
then
echo "All words present"
else
echo "Some words missing"
fi
How it works
${str1,,} returns the string str1 with all capitals replaced by lower case.
sed 's/_/\n/g' <<<${str1,,} returns the string str1, all converted to lower case and with underlines replaced by new lines so that each word is on a new line.
<(sed 's/_/\n/g' <<<${str1,,}) returns a file-like object containing all the words in str1, each word lower case and on a separate line.
The creation of file-like objects is called process substitution. It allows us, in this case, to treat the output of a shell command as if it were a file to read.
<(sed 's/_/\n/g' <<<${str2,,}) does the same for str2.
Assuming that file1 and file2 each have one word per line, grep -vwFf file1 file2 removes from file2 every occurrence of a word in file2. If there are no words left, that means that every word in file2 appears in file1.
By adding the option -q, grep will return no output but will set an exit code that we can use in our if statement.
In the actual command, file1 and file2 are replaced by our file-like objects.
The remaining grep options can be understood as follows:
-w tells grep to look for whole words only.
-F tells grep to look for fixed strings, not regular expressions.
-f tells grep to look for the patterns to match in the file (or file-like object) which follows.
-v tells grep to remove (the default is to keep) the words which match.
Here is an awk solution to check existence of all the words from a string in another string:
str1="This_is_a_big_sentence"
str2="Sentence_This_big"
awk -v RS=_ 'FNR==NR{a[tolower($1)]; next} {delete a[tolower($1)]} END{print (length(a)) ? "Not all words" : "All words"}' <(echo "$str2") <(echo "$str1")
With indentation:
awk -v RS=_ 'FNR==NR {
a[tolower($1)];
next
}
{ delete a[tolower($1)] }
END {
print (length(a)) ? "Not all words" : "All words"
}' <(echo "$str2") <(echo "$str1")
Explanation:
-v RS=_ We use record separator as _
FNR==NR - Execute this block for str2
a[tolower($1)]; next - Populate an array a with each lowercase word as key
{delete a[tolower($1)]} - For each word in str1 delete key in array a
END - If length of array a is still not 0 then there are some words left.
Here's another solution:
#!/bin/bash
str1="This_is_a_big_sentence"
str2="sentence_This_big"
var=0
var2=0
while read in
do
if [ $(echo $str1 | grep -ioE $in) ]
then
var=$((var+1))
fi
var2=$((var2+1))
done < <(echo $str2 | sed -e 's/\(.*\)/\L\1/' -e 's/_/\n/g')
if [[ $var -eq $var2 && $var -ne 0 ]]
then
echo "matched"
else
echo "not matched"
What this script does make str2 all lower case with sed -e 's/\(.*\)/\L\1/' which is a substitution of any character with its lower case, then replace underscores _ with return lines \n with the following sed expression: sed -e 's/_/\n/g', which is another substitution.
Now the individual words are fed into a while loop that compares str1 with the word that was fed in. Every time there's a match, increment var and every time we iterate though the while, we increment var2. If var == var2, then all the words of str2 were found in str1. Hope that helps.
Here's an approach.
if [ "$(echo "This_BIG_senTence" | grep -ioE 'this|big|sentence' | wc -l)" == "3" ]; then echo "matched"; fi
How it works.
grep options -i makes the grep case insensitive, -E for extended regular expressions, and -o separates the matches by line. Now that it is separated by line use wc with -l for line count. Since we had 3 conditions we check if it equals 3. Grep will return the lines where the match occurred, so if you are only working with a string, the example above will return the string for each condition, in this case 3, so there won't be any problems.
Note you can also create a grep chain and see if its empty.
if [ $(echo "This_BIG_SenTence" | grep -i this | grep -i big | grep -i sentence) ]; then echo matched; else echo not_matched; fi
Now I know what you mean. Try this:
#!/bin/bash
# add 4 non-matching examples
> snooze.foo_bar
> snooze.bar_go
> snooze.go_foo
> snooze.no_match
# add 3 matching examples
> snooze.foo_bar_go
> snooze.goXX_XXfoo_XXbarXX
> snooze.bar_go_foo_Ok
str1=("foo" "bar" "go")
for i in `ls snooze.*`; do
str2=${i#snooze.}
j=0
found=1
while [[ $j -lt ${#str1[#]} ]]; do
if ! echo $str2 | eval grep \${str1[$j]} >& /dev/null; then
found=0
break
fi
((j++))
done
if [[ $found -ne 0 ]]; then
echo Match found: $str2
fi
done
Resulting print of this script:
Match found: bar_go_foo_Ok
Match found: foo_bar_go
Match found: goXX_XXfoo_XXbarXX
alternatively, the if..grep line above can be replaced by
if [[ ! $str2 =~ `eval echo \${str1[$j]}` ]]; then
utilizing bash's regular expression match.
Note: I am not too careful about special characters in the search string, such as "\" or " " (space), which may cause problem.
--- Some explanations ---
In the if .. grep line, $j is first evaluated to the running index, from 0 to the number of elements in $str1 minus 1. Then, eval will re-evaluate the whole grep command again, causing ${str1[jjj]} to be re-evaluated (Here, jjj is the already evaluated index)
The strategy is to set found=1 (found by default), and then when any grep fails, we set found to 0 and break the inner j-loop.
Everything else should be straightforward.

How can I skip empty lines in bash with mapfile / readarray?

foo()
{
mapfile -t arr <<< "${1}"
for i in "${!arr[#]}"
do
if [ -z "${arr[$i]}" ]
then
unset arr[$i]
fi
done
}
When I pass a variable with some content in it ( a big string basically ), I would like to :
interpret the first word before the first whitespace as a key and everything after the first whitespace becomes the entry for that key in my associative array
skip empty lines or lines with just a newline
example ( empty lines not included for compactness )
google https://www.google.com
yahoo https://www.yahoo.com
microsoft https://www.microsoft.com
the array should look like
[ google ] == https://www.google.com
[ yahoo ] == https://www.yahoo.com
[ microsoft ] == https://www.microsoft.com
I haven't found any good solution in the bash manual for the 2 points, the function foo that you see it's kind of an hack that creates an array and only after that it goes through the entire array and deletes the entries where the string is null .
So point 2 gets a solution, probably an inefficient one, but it works, but point 1 still doesn't have a good solution, and the alternative solution is to just create an array while iterating with read, as far as I know .
Do you know how to improve this ?
mapfile doesn't build associative arrays (although if it could, the simplest solution to #2 would be to simply filter the input with, e.g., grep: mapfile -t arr < <(echo "$1" | grep -v "^$").
Falling back to an explicit loop using read, use the =~ operator to match and skip blank lines.
declare -A arr
while read key value; do
if [[ $value =~ "^\s*$" ]]; then # Or your favorite regex for skipping blank lines
continue
fi
arr["$key"]="$value"
done <<< "$1"
You can also skip blank lines using grep even with the while loop:
declare -A arr
while read key value; do
arr["$key"]="$value"
done < <(echo "$1" | grep '^\s*$')

Resources