Printing in Tabular format in TCL/PERL - file

I have a script in tcl in which a variable gets a collection of data in every loop and appends in a file. Suppose in loop1 ,
$var = {xy} {ty} {po} {iu} {ii}
and in loop2
$var = {a} {b} {c} {d1} {d2} {e3}
Now in a file f.txt the variable in dumped. Like puts $file $var. And in file it comes like this:
Line number 1: {xy} {ty} {po} {iu} {ii}
Line number 2: {a} {b} {c} {d1} {d2}
I want to print them finally in a file in tabular format. Like below:
xy a
ty b
po c
iu d1
ii d2

First, read the file in and extract the words on the first two lines:
set f [open "f.txt"]
set words1 [regexp -all -inline {\S+} [gets $f]]
set words2 [regexp -all -inline {\S+} [gets $f]]
close $f
The trick here is that regexp -all -inline returns all matching substrings, and \S+ selects non-whitespace character sequences.
Then, because we're producing tabular output, we need to measure the maximum size of the items in the first list. We might as well measure the second list at the same time.
set len1 [tcl::mathfunc::max {*}[lmap w $words1 {string length $w}]]
set len2 [tcl::mathfunc::max {*}[lmap w $words2 {string length $w}]]
The lmap applies a string length to each word, and then we find the maximum of them. {*} substitutes the list (of word lengths) as multiple arguments.
Now, we can iterate over the two lists and produce formatted output:
foreach w1 $words1 w2 $words2 {
puts [format "%-*s %-*s" $len1 $w1 $len2 $w2]
}
The format sequence %-*s consumes two arguments, one is the length of the field, and the other is the string to put in that field. It left-aligns the value within the field, and pads on the right with spaces. Without the - it would right-align; that's more useful for integers. You could instead use tab characters to separate, which usually works well if the words are short, but isn't so good once you get a wider mix of lengths.
If you're looking to produce an actual Tab-Separated Values file, the csv package in Tcllib will generate those fine with the right (obvious!) options.

Try this:
$ perl -anE 'push #{$vars[$_]}, ($F[$_] =~ s/^[{]|[}]$//gr) for 0.. $#F; END {say join "\t", #$_ for #vars}' f.txt
xy a
ty b
po c
iu d1
ii d2
command line switches:
-a : Turn on autosplit on white space to #F array.
-n : Loop over lines in input file, setting the #F array to the words on the current line.
-E : Execute the following argument as a one-liner
Removing surrounding braces from each words:
$F[$_] =~ s/^[{]|[}]$//gr
g : global substitution (we want to remove both { and })
r : non destructive operation, returns the result of the substitution instead of modifying #F

Related

Bash Convert text string into array with multiple \r\n as field seperator

I have a windows text file in the format:
line\r\n
line\r\n
line\r\n
r\n
line\r\n
line\r\n
line\r\n
r\n
...
I want to put this textfile into an array where the field seperator is \r\n\r\n - I did search for an answer but nothing I found and tried did work . awk for example is too complex for me and FS= did not work as I expected.
Commands to read arrays in bash can (as far as I know) only use single characters as a field separator, not complete strings like \r\n\r\n.
Workaround
First replace the field separator \r\n\r\n with a single char which is not used in the string to be splitted. I found \x1e (the ASCII control character »Record Separator«) to work out quite well.
Then read the array using the new (one character) field separator.
The field separator will always be removed when reading something to an array. But you can append the separator to each field.
Here is a pure bash solution to read the file file into the array array:
IFS=$'\x1e'
filecontent="$(< file)"
array=(${filecontent//$'\r\n\r\n'/$'\x1e'})
array=("${array[#]/%/$'\r\n\r\n'}")
IFS=$'\x1e' sets bash's field separator which is used to split strings into arrays. Depending on your script you may want to restore the old IFS afterwards (default is IFS=$' \t\n').
Results
For file
A B C\r\n
D E F\r\n
\r\n
G H I\r\n
\r\n
the resulting array will have two entries:
${array[0]}
A B C\r\n
D E F\r\n
\r\n
${array[1]}
G H I\r\n
\r\n
Known Problems
IFS at the beginning and end of the string will be trimmed. Repeated IFS will be squeezed. The file \r\n\r\n will result in an array without entries. Empty entries cannot be created.
\r\n\r\n is appended to all entries in all cases. The file A\r\n\r\nB will result an array with the two entries A\r\n\r\n and B\r\n\r\n.
In Linux all lines of files are terminated with \n.
So your problem is not the \r\n , it is just the \r. So just remove it:
$ tr -d '\r' <file >newfile
To verify that \r is removed you can do:
$ head -n2 newfile |od -t x1c
This will get the first two lines of the new file and the od tool will dump / convert those lines in ascii hex codes. In ascii hex \r is \x0d and \n is \x0a.
Once you have removed the \r from your file you can do anything you want.
You can use all linux tools (including awk) straight forward without special settings.
To built an array you can use:
$ while read -r line;do data+=("$line");done <newfile
If you want to skip blank lines , this one is enough:
$ while read -r line;do [[ "$line" == "" ]] && continue;data+=("$line") ;done <file1
You can offcourse combine array creation with removal of the \r on-the-fly, without modifying your existed file like this ( See online testing here. )
while read -r line;do [[ "$line" == "" ]] && continue;data+=("$line") ;done < <(tr -d '\r' <file1)
To see what is inside array "data" just use $ declare -p data
PS: By the way using awk -v RS="\r\n" '{you awk code here}' should be enough even to read the initial file in awk as well. RS = Record (lines) Separator
I made this script in pure bash, even if the answer from socowi is pure bash too:
exec < filern.txt
declare -a array
acc=""
lineno=0
cr=$(echo -en "\r")
while read line; do
line=${line%$cr}
if [ -z "$line" ]; then
let lineno=$lineno+1
array[$lineno]=$acc
acc=""
else
[ ! -z "$acc" ] && acc="$acc--" # you can use any separator here
acc="$acc$line"
fi
done
echo "Read file in array:"
for ((i=1; i<= ${#array[#]}; i++)) do
printf "%3.3d |%s|\n" $i "${array[$i]}"
done
It reads a "real" line of input at a time, and strips the trailing \r.
At this point, a sequence \r\n\r\n turns into an empty line, so that is used to assign the array elements one after the other.
The output from the example file is:
Read file in array:
001 |line--line--line|
002 |line--line--line|
The separator could also be a \r, or whatever. I coudn't find a way to clear the trailing \r with the command line=${line% ?? }, so I used a variable. The same trick can be used to add "strange" separator to the variable ACC. I hope it helps.

Bash, split words into letters and save to array

I'm struggling with a project. I am supposed to write a bash script which will work like tr command. At the beginning I would like to save all commands arguments into separated arrays. And in case if an argument is a word I would like to have each char in separated array field,eg.
tr_mine AB DC
I would like to have two arrays: a[0] = A, a[1] = B and b[0]=C b[1]=D.
I found a way, but it's not working:
IFS="" read -r -a array <<< "$a"
No sed, no awk, all bash internals.
Assuming that words are always separated with blanks (space and/or tabs),
also assuming that words are given as arguments, and writing for bash only:
#!/bin/bash
blank=$'[ \t]'
varname='A'
n=1
while IFS='' read -r -d '' -N 1 c ; do
if [[ $c =~ $blank ]]; then n=$((n+1)); continue; fi
eval ${varname}${n}'+=("'"$c"'")'
done <<<"$#"
last=$(eval echo \${#${varname}${n}[#]}) ### Find last character index.
unset "${varname}${n}[$last-1]" ### Remove last (trailing) newline.
for ((j=1;j<=$n;j++)); do
k="A$j[#]"
printf '<%s> ' "${!k}"; echo
done
That will set each array A1, A2, A3, etc. ... to the letters of each word.
The value at the end of the first loop of $n is the count of words processed.
Printing may be a little tricky, that is why the code to access each letter is given above.
Applied to your sample text:
$ script.sh AB DC
<A> <B>
<D> <C>
The script is setting two (array) vars A1 and A2.
And each letter is one array element: A1[0] = A, A1[1] = B and A2[0]=C, A2[1]=D.
You need to set a variable ($k) to the array element to access.
For example, to echo fourth letter (0 based) of second word (1 based) you need to do (that may be changed if needed):
k="A2[3]"; echo "${!k}" ### Indirect addressing.
The script will work as this:
$ script.sh ABCD efghi
<A> <B> <C> <D>
<e> <f> <g> <h> <i>
Caveat: Characters will be split even if quoted. However, quoted arguments is the correct way to use this script to avoid the effect of shell metacharacters ( |,&,;,(,),<,>,space,tab ). Of course, spaces (even if repeated) will split words as defined by the variable $blank:
$ script.sh $'qwer;rttt fgf\ngfg'
<q> <w> <e> <r> <;> <r> <t> <t> <t>
<>
<>
<>
<f> <g> <f> <
> <g> <f> <g>
As the script will accept and correctly process embebed newlines we need to use: unset "${varname}${n}[$last-1]" to remove the last trailing "newline". If that is not desired, quote the line.
Security Note: The eval is not much of a problem here as it is only processing one character at a time. It would be difficult to create an attack based on just one character. Anyway, the usual warning is valid: Always sanitize your input before using this script. Also, most (not quoted) metacharacters of bash will break this script.
$ script.sh qwer(rttt fgfgfg
bash: syntax error near unexpected token `('
I would strongly suggest to do this in another language if possible, it will be a lot easier.
Now, the closest I come up with is:
#!/bin/bash
sentence="AC DC"
words=`echo "$sentence" | tr " " "\n"`
# final array
declare -A result
# word count
wc=0
for i in $words; do
# letter count in the word
lc=0
for l in `echo "$i" | grep -o .`; do
result["w$wc-l$lc"]=$l
lc=$(($lc+1))
done
wc=$(($wc+1))
done
rLen=${#result[#]}
echo "Result Length $rLen"
for i in "${!result[#]}"
do
echo "$i => ${result[$i]}"
done
The above prints:
Result Length 4
w1-l1 => C
w1-l0 => D
w0-l0 => A
w0-l1 => C
Explanation:
Dynamic variables are not supported in bash (ie create variables using variables) so I am using an associative array instead (result)
Arrays in bash are single dimension. To fake a 2D array I use the indexes: w for words and l for letters. This will make further processing a pain...
Associative arrays are not ordered thus results appear in random order when printing
${!result[#]} is used instead of ${result[#]}. The first iterates keys while the second iterates values
I know this is not exactly what you ask for, but I hope it will point you to the right direction
Try this :
sentence="$#"
read -r -a words <<< "$sentence"
for word in ${words[#]}; do
inc=$(( i++ ))
read -r -a l${inc} <<< $(sed 's/./& /g' <<< $word)
done
echo ${words[1]} # print "CD"
echo ${l1[1]} # print "D"
The first read reads all words, the internal one is for letters.
The sed command add a space after each letters to make the string splittable by read -a. You can also use this sed command to remove unwanted characters from words (eg commas) before splitting.
If special characters are allowed in words, you can use a simple grep instead of the sed command (as suggested in http://www.unixcl.com/2009/07/split-string-to-characters-in-bash.html) :
read -r -a l${inc} <<< $(grep -o . <<< $word)
The word array is ${w}.
The letters arrays are named l# where # is an increment added for each word read.

Tcl-Tk Find a specific word in a file

I am trying to find a matching row in a a text file having 4 columns of numbers like this:
number coordinates
101138 0.420335 -.238945 .1446484
101139 .4134844 -0.2437 6.7484e-2
101140 .4140046 -.243681 7.3344e-2
I need to read the text file and find a specific number in the first column and plot only its coordinates.
This is my code in which I try to find the coordinates for number "101138" but something is not working because there is no match found.
set Output [open "Output1.txt" w]
set FileInput [open "Input.txt" r]
set filecontent [read $FileInput]
set inputList [split $filecontent "\n"]
set Text [lsearch -all -inline $inputList "101138"]
foreach elem $Text {
puts $Output "[lindex $elem 1] [lindes $elem 2] [lindex $elem 3]"
}
You are searching for a list element that exactly matches your given value "101138". However your list is constructed from lines which have multiple whitespace delimited columns. You need to amend your search to match this value in the correct column.
One method would be to split each line again and perform an equals match on the correct column. Another might be to use a glob or regexp expression that actually matches the inputs. ie:
% set lst {"123 abc def" "456 efg ijk" "789 zxc cvb"}
"123 abc def" "456 efg ijk" "789 zxc cvb"
% lsearch -all -inline $lst "456*"
{456 efg ijk}
% lsearch -all -inline -regexp $lst "^456"
{456 efg ijk}
The second line does a standard (glob) match looking for a list element beginning with 456 followed by anything.
The last line searches for a list element that begins with "456" using regular expression matching.

Shell script array from command line

I'm trying to write a shell script that can accept multiple elements on the command line to be treated as a single array. The command line argument format is:
exec trial.sh 1 2 {element1 element2} 4
I know that the first two arguments are can be accessed with $1 and $2, but how can I access the array surrounded by the brackets, that is the arguments surrounded by the {} symbols?
Thanks!
This tcl script uses regex parsing to extract pieces of the commandline, transforming your third argument into a list.
Splitting is done on whitespaces - depending on where you want to use this may or may not be sufficient.
#!/usr/bin/env tclsh
#
# Sample arguments: 1 2 {element1 element2} 4
# Split the commandline arguments:
# - tcl will represent the curly brackets as \{ which makes the regex a bit ugly as we have to escape this
# - we use '->' to catch the full regex match as we are not interested in the value and it looks good
# - we are splitting on white spaces here
# - the content between the curly braces is extracted
regexp {(.*?)\s(.*?)\s\\\{(.*?)\\\}\s(.*?)$} $::argv -> first second third fourth
puts "Argument extraction:"
puts "argv: $::argv"
puts "arg1: $first"
puts "arg2: $second"
puts "arg3: $third"
puts "arg4: $fourth"
# Third argument is to be treated as an array, again split on white space
set theArguments [regexp -all -inline {\S+} $third]
puts "\nArguments for parameter 3"
foreach arg $theArguments {
puts "arg: $arg"
}
You should always place variable length arguments at the end. But if you can guarantee you always mjust provide the last argument, then something like this will suffice:
#!/bin/bash
arg1=$1 ; shift
arg2=$1 ; shift
# Get the array passed in.
arrArgs=()
while (( $# > 1 )) ; do
arrArgs=( "${arrArgs[#]}" "$1" )
shift
done
lastArg=$1 ; shift

Which data structure might be a more efficient implementation?

I was doing an exercise on reading from a setup file in which every line specifies two words and a number. The number denotes the number of words in between the two words specified. Another file – input.txt – has a block of text, and the program attempts to count the number of occurrences in the input file which follows the constraints in each line in the setup file (i.e., two particular words a and b should be separated by n words, where a, b and n are specified in the setup file.
So I've tried to do this as a shell script, but my implementation is probably highly inefficient. I used an array to store the words from the setup file, and then did a linear search on the text file to find out the words, and the works. Here's a bit of the code, if it helps:
#!/bin/sh
j=0
count=0;
m=0;
flag=0;
error=0;
while read line; do
line=($line);
a[j]=${line[0]}
b[j]=${line[1]}
num=${line[2]}
c[j]=`expr $num + 0`
j=`expr $j + 1`
done <input2.txt
while read line2; do
line2=($line2)
for (( i=0; $i<=50; i++ )); do
for (( m=0; $m<j; m++)); do
g=`expr $i + ${c[m]}`
g=`expr $g + 1`
if [ "${line2[i]}" == "${a[m]}" ] ; then
for (( k=$i; $k<$g; k++)); do
if [[ "${line2[k]}" == *.* ]]; then
flag=1
break
fi
done
if [ "${b[m]}" == "${line2[g]}" ] ; then
if [ "$flag" == 1 ] ; then
error=`expr $error + 1`
fi
count=`expr $count + 1`
fi
flag=0
fi
if [ "${line2[i]}" == "${b[m]}" ] ; then
for (( k=$i; $k<$g; k++)); do
if [[ "${line2[k]}" == *.* ]]; then
flag=1
break
fi
done
if [ "${a[m]}" == "${line2[g]}" ] ; then
if [ "$flag" == 1 ] ; then
error=`expr $error + 1`
fi
count=`expr $count + 1`
fi
flag=0
fi
done
done
done <input.txt
count=`expr $count - $error`
echo "| Count = $count |"
As you can see, this takes a lot of time.
I was thinking of a more efficient way to implement this, in C or C++, this time. What could be a possible alternative implementation of this, efficiency considered? I thought of hash tables, but could there be a better way?
I'd like to hear what everyone has to say on this.
Here's a fully working possibility. It is not 100% pure bash since it uses (GNU) sed: I'm using sed to lowercase everything and to get rid of punctuation marks. Maybe you won't need this. Adapt to your needs.
#!/bin/bash
input=input.txt
setup=setup.txt
# The Check function
Check() {
# $1 is word1
# $2 is word2
# $3 is number of words between word1 and word2
nb=0
# Get all positions of w1
IFS=, read -a q <<< "${positions[$1]}"
# Check, for each position, if word2 is at distance $3 from word1
for i in "${q[#]}"; do
[[ ${words[$i+$3+1]} = $2 ]] && ((++nb))
done
echo "$nb"
}
# Slurp input file in an array
words=( $(sed 's/[,.:!?]//g;s/\(.*\)/\L\1/' -- "$input") )
# For each word, specify its positions in file
declare -A positions
pos=0
for i in "${words[#]}"; do
positions[$i]+=$((pos++)),
done
# Do it!
while read w1 w2 p; do
# Check that w1 w2 are not empty
[[ -n $w2 ]] || continue
# Check that p is a number
[[ $p =~ ^[[:digit:]]+$ ]] || continue
n=$(Check "$w1" "$w2" "$p")
[[ $w1 != $w2 ]] && (( n += $(Check "$w2" "$w1" "$p") ))
echo "$w1 $w2 $p: $n"
done < <(sed 's/\(.*\)/\L\1/' -- "$setup")
How does it work:
we first read the whole file input.txt in the array words: a word per field. Observe I'm using sed here to delete all punctuation marks (well, only ,, ., :, !, ?, for testing purposes, add some more if you wish) and to lowercase every letter.
Loop through the array words and for each word, put its position in an associative array positions:
w => "position1,position2,...,positionk,"
Finally, we read the setup.txt file (filtered through sed again to lowercase everything – optional see below). Do a quick check whether the line is valid (2 words and a number) and then call the Check function (twice, for each permutation of the given words, unless both words are equal).
The Check function finds all positions of word1 in file, thanks to associative array positions and then using the array words, check whether word2 is at the given "distance" from word1.
The second sed is optional. I've filtered the setup.txt file through sed to lowercase everything. This sed will leave only very little overhead, so, efficiency-wise, it's not a big deal. You'll be able to add more filtering later to make sure the data is consistent with how the script will use it (e.g., get rid of punctuation marks). Otherwise you could:
Get rid of it altogether: replace the corresponding line (the last line) with just
done < "$setup"
In this case, you'll have to trust the guy/gal who will write the setup.txt file.
Get rid of it as above, but still want to convert everything to lowercase. In this case, below the
while read w1 w2 p; do
line, just add these lines:
w1=${w1,,}
w2=${w2,,}
That's the bash way to lowercase a string.
Caveats. The script will break if:
The number given in setup.txt file starts with a 0 and contains an 8 or a 9. This is because bash will consider it's an octal number, where 8's and 9's are not valid. There are workarounds for this.
The text in input.txt doesn't follow proper typographical practices: a punctuation mark is always followed by a space. E.g., if the input file contains
The quick,brown,dog jumps over the lazy fox
then after the sed treatment the text will look like
The quickbrowndog jumps over the lazy fox
and the words quick, brown and dog won't be treated properly. You can replace the sed substitution s/[,:!?]//g with s/[,:!?]/ /g to convert these symbols with a space. It's up to you, but in that case, abbreviations as, e.g., e.g. and i.e. might not be considered properly… it now really depends what you need to do.
Different character encodings are used… I don't really know how robust you need the script to be, and what languages and encodings you'll consider.
(Add stuff here :).)
About efficiency. I'd say the algorithm is rather efficient. bash is probably not the best suited language for that, but it's a lot of fun, and not that difficult after all if we look at it (less than 20 lines of relevant code, and even less than that!). If you only have 50 files with 50000 words, it's ok, you will not notice too much difference between bash and perl/python/awk/C/you-name-it: bash performs decently quickly for files of this type. Now if you have 100000 files each containing millions of words, well, a different approach should be taken and a different language should be used (but I don't know which one).
If:
it can get complex for the sake of efficiency
the text file can be large
the setup file can have many rows
then I would do it the following way:
As preparation I would create:
A hash map with the index of the word as key and the word as the value (named -say- WORDS). So WORDS[1] would be the first word, WORDS[2] the second, and so on.
A hashmap with the words as keys and the list of indexes as values (named -say- INDEXES). So if WORDS[2] and WORDS[5] is "dog" and none other, than INDEXES["dog"] would yield the numers 2 and 5. The value can be a dynamic indexed array or a linked list. Linked list is better if there are words that occur many times.
You can read the text file, and populate both structures at the same time.
Processing:
For each row of the setup file I would get the indexes in INDEXES[firstword] and check if WORDS[index + wordsinbetween + 1] equals with secondword. If it does, that's a hit.
Notes:
Preparation: You only read the text file once. For each word in the text file, you're doing fast operations thats' performance is not really effected by the amount of words already processed.
Processing: You only read the setup file once. For each row you're here too doing operations that are only effected by the number of occurences of firstword in the text file.

Resources