Tcl-Tk Find a specific word in a file - file

I am trying to find a matching row in a a text file having 4 columns of numbers like this:
number coordinates
101138 0.420335 -.238945 .1446484
101139 .4134844 -0.2437 6.7484e-2
101140 .4140046 -.243681 7.3344e-2
I need to read the text file and find a specific number in the first column and plot only its coordinates.
This is my code in which I try to find the coordinates for number "101138" but something is not working because there is no match found.
set Output [open "Output1.txt" w]
set FileInput [open "Input.txt" r]
set filecontent [read $FileInput]
set inputList [split $filecontent "\n"]
set Text [lsearch -all -inline $inputList "101138"]
foreach elem $Text {
puts $Output "[lindex $elem 1] [lindes $elem 2] [lindex $elem 3]"
}

You are searching for a list element that exactly matches your given value "101138". However your list is constructed from lines which have multiple whitespace delimited columns. You need to amend your search to match this value in the correct column.
One method would be to split each line again and perform an equals match on the correct column. Another might be to use a glob or regexp expression that actually matches the inputs. ie:
% set lst {"123 abc def" "456 efg ijk" "789 zxc cvb"}
"123 abc def" "456 efg ijk" "789 zxc cvb"
% lsearch -all -inline $lst "456*"
{456 efg ijk}
% lsearch -all -inline -regexp $lst "^456"
{456 efg ijk}
The second line does a standard (glob) match looking for a list element beginning with 456 followed by anything.
The last line searches for a list element that begins with "456" using regular expression matching.

Related

Bash: how to extract longest directory paths from an array?

I put the output of find command into array like this:
pathList=($(find /foo/bar/ -type d))
How to extract the longest paths found in the array if the array contains several equal-length longest paths?:
echo ${pathList[#]}
/foo/bar/raw/
/foo/bar/raw/2020/
/foo/bar/raw/2020/02/
/foo/bar/logs/
/foo/bar/logs/2020/
/foo/bar/logs/2020/02/
After extraction, I would like to assign /foo/bar/raw/2020/02/ and /foo/bar/logs/2020/02/ to another array.
Thank you
Could you please try following. This should print the longest array(could be multiple in numbers same maximum length ones), you could assign it to later an array to.
echo "${pathList[#]}" |
awk -F'/' '{max=max>NF?max:NF;a[NF]=(a[NF]?a[NF] ORS:"")$0} END{print a[max]}'
I just created a test array with values provided by you and tested it as follows:
arr1=($(printf '%s\n' "${pathList[#]}" |\
awk -F'/' '{max=max>NF?max:NF;a[NF]=(a[NF]?a[NF] ORS:"")$0} END{print a[max]}'))
When I see new array's contents they are as follows:
echo "${arr1[#]}"
/foo/bar/raw/2020/02/
/foo/bar/logs/2020/02/
Explanation of awk code: Adding detailed explanation for awk code.
awk -F'/' ' ##Starting awk program from here and setting field separator as / for all lines.
{
max=max>NF?max:NF ##Creating variable max and its checking condition if max is greater than NF then let it be same else set its value to current NF value.
a[NF]=(a[NF]?a[NF] ORS:"")$0 ##Creating an array a with index of value of NF and keep appending its value with new line to it.
}
END{ ##Starting END section of this program.
print a[max] ##Printing value of array a with index of variable max.
}'

Printing in Tabular format in TCL/PERL

I have a script in tcl in which a variable gets a collection of data in every loop and appends in a file. Suppose in loop1 ,
$var = {xy} {ty} {po} {iu} {ii}
and in loop2
$var = {a} {b} {c} {d1} {d2} {e3}
Now in a file f.txt the variable in dumped. Like puts $file $var. And in file it comes like this:
Line number 1: {xy} {ty} {po} {iu} {ii}
Line number 2: {a} {b} {c} {d1} {d2}
I want to print them finally in a file in tabular format. Like below:
xy a
ty b
po c
iu d1
ii d2
First, read the file in and extract the words on the first two lines:
set f [open "f.txt"]
set words1 [regexp -all -inline {\S+} [gets $f]]
set words2 [regexp -all -inline {\S+} [gets $f]]
close $f
The trick here is that regexp -all -inline returns all matching substrings, and \S+ selects non-whitespace character sequences.
Then, because we're producing tabular output, we need to measure the maximum size of the items in the first list. We might as well measure the second list at the same time.
set len1 [tcl::mathfunc::max {*}[lmap w $words1 {string length $w}]]
set len2 [tcl::mathfunc::max {*}[lmap w $words2 {string length $w}]]
The lmap applies a string length to each word, and then we find the maximum of them. {*} substitutes the list (of word lengths) as multiple arguments.
Now, we can iterate over the two lists and produce formatted output:
foreach w1 $words1 w2 $words2 {
puts [format "%-*s %-*s" $len1 $w1 $len2 $w2]
}
The format sequence %-*s consumes two arguments, one is the length of the field, and the other is the string to put in that field. It left-aligns the value within the field, and pads on the right with spaces. Without the - it would right-align; that's more useful for integers. You could instead use tab characters to separate, which usually works well if the words are short, but isn't so good once you get a wider mix of lengths.
If you're looking to produce an actual Tab-Separated Values file, the csv package in Tcllib will generate those fine with the right (obvious!) options.
Try this:
$ perl -anE 'push #{$vars[$_]}, ($F[$_] =~ s/^[{]|[}]$//gr) for 0.. $#F; END {say join "\t", #$_ for #vars}' f.txt
xy a
ty b
po c
iu d1
ii d2
command line switches:
-a : Turn on autosplit on white space to #F array.
-n : Loop over lines in input file, setting the #F array to the words on the current line.
-E : Execute the following argument as a one-liner
Removing surrounding braces from each words:
$F[$_] =~ s/^[{]|[}]$//gr
g : global substitution (we want to remove both { and })
r : non destructive operation, returns the result of the substitution instead of modifying #F

Processing large files using Tcl

I have some information in two large files.
One of them(file1.txt, has ~ 4 million lines) contains all object names(which are unique) and types.
And the other(file2.txt, has ~ 2 million lines) some object names(they can be duplicated) and some values assigned to them.
So, I have something like below in file1.txt:
objName1 objType1
objName2 objType2
objName3 objType3
...
And in file2.txt I have:
objName3 val3_1
objName3 val3_2
objName4 val4
...
For the all objects in file2.txt I need to output object names, their types and values assigned to them in a single file like below:
objType3 val3_1 "objName3"
objType3 val3_2 "objName3"
objType4 val4 "objName4"
...
Previously object names in file2.txt supposed to be unique, so I've implemented some solution, where I'm reading all the data from both files, saving them to a Tcl arrays, and then iterating over larger array and checking whether object with the same name exists in a smaller array, and if so, writing my needed information to a separate file. But this runs too long (> 10 hours and hasn't completed yet).
How can I improve my solution, or is there another way to do this?
EDIT:
Actually I don't have file1.txt, I'm finding that data by some procedure and writing it into Tcl array. I'm running some procedure to get object types and save them to a Tcl array, then, I'm reading file2.txt and saving data to a Tcl array, then I'm iterating over items in the first array, and if object name match some object in second(object values) array, I'm writing info to output file and erasing that element from the second array. Here is a piece of code that I'm running:
set outFileName "output.txt"
if [catch {open $outFileName "w"} fid ] {
puts "ERROR: Failed to open file '$outFileName', no write permission"
exit 1
}
# get object types
set TIME_start [clock clicks -milliseconds]
array set objTypeMap [list]
# here is some proc that fills up objTypeMap
set TIME_taken [expr [clock clicks -milliseconds] - $TIME_start]
puts "Info: Object types are found. Elapsed time $TIME_taken"
# read file2.txt
set TIME_start [clock clicks -milliseconds]
set file2 [lindex $argv 5]
if [catch { set fp [open $file2 r] } errMsg] {
puts "ERROR: Failed to open file '$file2' for reading"
exit 1
}
set objValData [read $fp]
close $fp
# tcl list containing lines of file2.txt
set objValData [split $objValData "\n"]
# remove last empty line
set objValData [lreplace $objValData end end]
array set objValMap [list]
foreach item $objValData {
set objName [string range $item 0 [expr {[string first " " $item] - 1}] ]
set objValue [string range $item [expr {[string first " " $item] + 1}] end ]
set objValMap($instName) $objValue
}
# clear objValData
unset objValData
set TIME_taken [expr [clock clicks -milliseconds] - $TIME_start]
puts "Info: Object value data is read and processed. Elapsed time $TIME_taken"
# write to file
set TIME_start [clock clicks -milliseconds]
foreach { objName objType } [array get objTypeMap] {
if { [array size objValMap] eq 0 } {
break
}
if { [info exists objValMap($objName)] } {
set objValue $objValMap($objName)
puts $fid "$objType $objValue \"$objName\""
unset objValMap($objName)
}
}
if { [array size objValMap] neq 0 } {
foreach { objName objVal } [array get objValMap] {
puts "WARNING: Can not find object $objName type, skipped..."
}
}
close $fid
set TIME_taken [expr [clock clicks -milliseconds] - $TIME_start]
puts "Info: Output is cretaed. Elapsed time $TIME_taken"
Seems for the last step (writing to a file) there are ~8 * 10^12 iterations to do, and it's not realistic to complete in a reasonable time, because I've tried to do 8 * 10^12 iterations in a for loop and just print the iteration index, and ~850*10^6 iterations took ~30 minutes (so, the whole loop will finish in ~11hours).
So, there should be another solution.
EDIT:
Seems the reason was some unsuccessful hashing for file2.txt map, as I've tried to shuffle lines in file2.txt and got results in about 3 minutes.
Write the data to file1, and let an external tool do all the hard work (it's bound to be much more optimized for the task than home-spun Tcl code)
exec bash -c {join -o 0,1.2,2.2 <(sort file1.txt) <(sort file2.txt)} > result.txt
So… file1.txt is describing a mapping and file2.txt is the list of things to process and annotate? The right thing is to load the mapping into an array or dictionary where the key is the part that you will look things up by, and to then go through the other file line-by-line. That keeps the amount of data in memory down, but it's worth holding the whole mapping that way anyway.
# We're doing many iterations, so worth doing proper bytecode compilation
apply {{filename1 filename2 filenameOut} {
# Load the mapping; uses memory proportional to the file size
set f [open $filename1]
while {[gets $f line] >= 0} {
regexp {^(\S+)\s+(.*)} $line -> name type
set types($name) $type
}
close $f
# Now do the streaming transform; uses a small fixed amount of memory
set fin [open $filename2]
set fout [open $filenameOut "w"]
while {[gets $fin line] >= 0} {
# Assume that the mapping is probably total; if a line fails we're print it as
# it was before. You might have a different preferred strategy here.
catch {
regexp {^(\S+)\s+(.*)} $line -> name info
set line [format "%s %s \"%s\"" $types($name) $info $name]
}
puts $fout $line
}
close $fin
close $fout
# All memory will be collected at this point
}} "file1.txt" "file2.txt" "fileProcessed.txt"
Now, if the mapping is very large, so much that it doesn't fit in memory, then you might be better doing it via building file indices and stuff like that, but frankly then you're actually better off getting familiar with SQLite or some other database.
A pure-Tcl variant of Glenn Jackman's code would be
package require fileutil
package require struct::list
set data1 [lsort -index 0 [split [string trim [fileutil::cat file1.txt]] \n]]
set data2 [lsort -index 0 [split [string trim [fileutil::cat file2.txt]] \n]]
fileutil::writeFile result.txt [struct::list dbJoin -full 0 $data1 0 $data2]
But in this case each row will have four columns, not three: the two columns from file1.txt and the two columns from file2.txt. If that is a problem, reducing the number of columns to three is trivial.
The file join in the example is also full, i.e. all rows from both files will occur in the result, padded by empty strings if the other file has no corresponding data. To solve the OP's problem, an inner join is probably better (only rows that correspond are retained).
fileutil::cat reads the contents of a file, string trim removes leading and trailing whitespace from the contents, to avoid empty lines in the beginning or end, split ... \n creates a list where every row becomes an item, lsort -index 0 sorts that list based on the first word in every item.
The code is verified to work with Tcl 8.6 and fileutil 1.14.8. The fileutil package is a part of the Tcllib companion library for Tcl: the package can be individually upgraded to the current version by downloading the Tcl source and copying it to the relevant location in the Tcl installation's lib tree (C:\Tcl\lib\teapot\package\tcl\teapot\tcl8\8.2 in my case).
Quick-and-dirty install: download fileutil.tcl from here (use the Download button) and copy the file to where your other sources are. In your source code, call source fileutil.tcl and then package require fileutil. (There may still be compatibility problems with Tcl or with e.g. the cmdline package. Reading the source may suggest workarounds for such.) Remember to check the license conditions for conflicts.
Documentation: fileutil package, lsort, package, set, split, string, struct::list package

Sorting by columns in perl

I have a file named all_energy.out and I am trying to sort it in a way so I can renumber files in the directory based on the lowest energy in the all_energy.out file. So I want to create an array with the file names and energy starting at the lowest energy and going to increasing energy like name and age.
Analogous Example:
Don 24
Jan 30
Sue 19
sorted to
Sue 19
Don 24
Jan 30
Example of all_energy.out file: The highest negative value is the lowest energy.
Energy
0001_IP3_fullBinding_Rigid0001 -219.209742
0001_IP3_fullBinding_Rigid0002 -219.188106
0001_IP3_fullBinding_Rigid0003 -219.064542
0001_IP3_fullBinding_Rigid0004 -219.050730
0001_IP3_fullBinding_Rigid0005 -219.044573
0001_IP3_fullBinding_Rigid0006 -218.927479
0001_IP3_fullBinding_Rigid0007 -218.919717
0001_IP3_fullBinding_Rigid0008 -218.900923
0001_IP3_fullBinding_Rigid0009 -218.898945
0001_IP3_fullBinding_Rigid0010 -218.889269
0001_IP3_fullBinding_Rigid0011 -218.871619
0001_IP3_fullBinding_Rigid0012 -218.859429
0001_IP3_fullBinding_Rigid0013 -218.848516
0001_IP3_fullBinding_Rigid0014 -218.835355
0001_IP3_fullBinding_Rigid0015 -218.822244
0001_IP3_fullBinding_Rigid0016 -218.819328
0001_IP3_fullBinding_Rigid0017 -218.818431
0001_IP3_fullBinding_Rigid0018 -218.815494
0001_IP3_fullBinding_Rigid0019 -218.798388
0001_IP3_fullBinding_Rigid0020 -218.792151
Energy
0002_IP3_fullBinding_Rigid0001 -226.007998
0002_IP3_fullBinding_Rigid0002 -225.635657
The file names are given before the energy value, for example 0001_IP3_fullBinding_Rigid0001.mol2 is the name of the first file.
Example solution:
0002_IP3_fullBinding_Rigid0001 -226.007998
0002_IP3_fullBinding_Rigid0002 -225.635657
0001_IP3_fullBinding_Rigid0001 -219.209742
0001_IP3_fullBinding_Rigid0002 -219.188106
0001_IP3_fullBinding_Rigid0003 -219.064542
My current script is:
#!/usr/bin/perl
use strict;
use warnings;
print "Name of all total energy containing file:\n";
my $energy_file = <STDIN>;
chomp $energy_file;
my $inputfile_energy = $energy_file;
open (INPUTFILE_ENERGY, "<", $inputfile_energy) or die $!;
print map $inputfile_energy->[0],
sort { $a->[1] <=> $b->[1] }
map { [ $_, /(\d+)$/ ] }
<INPUTFILE_ENERGY>;
close $inputfile_energy;
At this point I am just trying to get the energy with their names to print to the correct order. Then I will loop through the files in the directory and when the name matches with the sorted energy names it will be renumber.
Problems with your script:
/(\d+)$/ only matches digits (0-9) at the end of a line. Your file contains floating point numbers, so only digits after the decimal point will be matched. You could get away with /(\S+)$/ instead. (Actually, in your sample input there is a line with a trailing space, so let's make that /(\S+)\s*$/ instead)
$inputfile_energy is a filename, a scalar, and not a reference, so $inputfile_energy->[0] doesn't make sense. You use it as the expression in a map construction, and in a map EXPR, LIST construction, $_ refers to the current element of the list that is being iterated through, so you probably just meant to say $_->[0].
Your input contains a few lines -- all with the keyword Energy -- that don't have the same format as the other lines you want to sort and should be filtered out.
Putting this all together, I get working code when the penultimate statement looks like:
print map $_->[0],
sort { $a->[1] <=> $b->[1] }
map { [ $_, /(\S+)\s*$/ ] }
grep /\d/,
<INPUTFILE_ENERGY>;
you can use oneliner like this and run it from command line:
perl -lnae 'push #arr, [$_, $F[1]] if $F[1]; END { print join "\n", map {$_->[0]} sort {$a->[1] <=> $b->[1]} #arr }' energy_file.txt
1) special key -n makes the loop over all lines in input file (energy_file.txt); current line is available in $_ variable.
2) then key -a splits each line by whitespaces and puts nonempty values into #F array.
A less "idiomatic" solution could be :
#data = <DATA>;
my #table;
foreach(#data){
chomp;
next unless /^0/; # skip Energy lines (or any other cleaning test)
#line = split /\s+/;
push #table,[#line]; # build a 2d array
}
my #sortedTable = sort { $a->[1] <=> $b->[1] } #table;
foreach(#sortedTable){
printf(
"%5s,%25s\n",
$_->[0],
$_->[1]
) # some pretty printing
}
__DATA__
Energy
0001_IP3_fullBinding_Rigid0001 -219.209742
0001_IP3_fullBinding_Rigid0002 -219.188106
0001_IP3_fullBinding_Rigid0003 -219.064542
0001_IP3_fullBinding_Rigid0004 -219.050730
....
Try this:
print join "\n", sort {(split /\s+/,$a)[1] <=> (split /\s+/,$b)[1]} map{chomp $_; $_} <INPUTFILE_ENERGY>;

bash: split text file at braces into array

I searched the site very thoroughly but was not able to turn up a fitting answer - most probably I wasn't asking the correct questions.
I have a text-file with up to a several thousand lines of coordinates formatted as in the following example:
[1]
-75.4532 75.8273
-115.00 64.5
-90.00 74.3333
-100.00 72.4167
-110.00 69.00
-120.8 56.284
[2]
-70.00 73.75
-100.00 69.3333
-110.00 65.1533
-90.00 71.5833
-80.00 73.00
[3]
-100.00 67.5
-67.7133 72.6611
-80.00 71.5
-90.00 70.00
-110.00 63.8667
-115.8 60.836
What I'm trying to achieve is to split the file into an array at the numbers in brackets. So that I can use the number in brackets as the arrays index and the following lines as the corresponding value.
The next step would be looping through the array feeding each element to another program. If there is a more elegant approach I'm willing to listen.
All the best!
You can use sed to massage the file into a bash array definition:
declare -a "$(sed 's/\[/" &/g; s/\]/&="/g' file | sed '1s/^"/arr=(/; $s/$/")/')"
echo "${arr[2]}"
echo
echo ${arr[2]}
-70.00 73.75
-100.00 69.3333
-110.00 65.1533
-90.00 71.5833
-80.00 73.00
-70.00 73.75 -100.00 69.3333 -110.00 65.1533 -90.00 71.5833 -80.00 73.00
Printing with and without quotes to demonstrate the difference
Use a combination of read -d (to set the record delimiter) and IFS (to set the field separator):
# read content from file
content="$(<input_filename)"
# append record separator to avoid dropping the last record
content="$content["
# read into array
arr=()
while IFS=']' read -d '[' sub value; do
arr[$sub]=$value
done <<<"$content"
The resulting array will have an empty first element since it's zero-based. This can make it trickier to loop over it. You can remove the first element explicitly to make the loop easier:
unset arr[0]
Now you can loop over the elements:
for value in "${arr[#]}"; do
program < "$value"
done
or if you need the 1-based index as well:
for ((i=1; i<=${#arr[#]}; i++)); do
program "$i" "$value"
done
Hope that helps!

Resources