Gnuplot: Iterate over folders - arrays

I have different folders with datasets called e.g.
3-1-1
3-1-2
3-2-1
3-1-2
the first placeholder is fixed, the second and third are elements of a list:
k1values = "1 2"
k2values = "1 2"
I want to do easy operations in my Gnuplot script e.g. cd to the above directories and read a line of a textfile. First, it shall cd to the folder, read a file and cd back again etc.
My first (1) idea was to connect system command and sprintf:
do for[i=1:words(k1values)]{
do for[j=1:words(k2values)]{
system sprintf("cd 3-%d-%d", i, j)
system 'pwd'
system 'cd ..'
}
}
with that the same path is being printed, so no CD is happening at all.
or system 'cd sprintf("3-%d-%d", i, j)'
Unfortunately, this is not working.
Error message: sh: 1: Syntax error: "(" unexpected
I also tried concatenating the values to a string and enter it as a path: This also doesn't work:
k1values = "1 2"
k2values = "1 2"
string1 = '3'
do for[i=1:words(k1values)]{
do for[j=1:words(k2values)]{
path = sprintf("%s-%d-%d", string1, i, j)
system sprintf("cd %s", path)
system 'pwd'
system 'cd ..'
}
}
I print the path for testing, but the operating path is not being changed at all.
Thanks in advance!
Edit: The idea in a given pseudo code is like this:
do for k1
do for k2
valueX = <readingCommand>
make dir "3-k1-k2/Pictures"
for int i = 0; i<valueX; i++
set output bla
plot "3-k1-k2/Data/i.txt" <options>
end for
end do for
end do for

Unless there is a reason which we don't know yet, why do you want to change back and forth into the subdirectories?
Why not creating your path/filename via a function and load the desired file and plot the desired lines?
For example, if you have the following directory structure:
CurrentFolder
3-1-1
Data.dat
3-1-2
Data.dat
3-2-1
Data.dat
3-2-2
Data.dat
and the following files:
3-1-1/Data.dat
1 1.14
2 1.15
3 1.12
4 1.11
5 1.13
3-1-2/Data.dat
1 1.24
2 1.25
3 1.22
4 1.21
5 1.23
3-2-1/Data.dat
1 2.14
2 2.15
3 2.12
4 2.11
5 2.13
3-2-2/Data.dat
1 2.24
2 2.25
3 2.22
4 2.21
5 2.23
The following example loads all the files Data.dat from the corresponding subdirectories and plots the lines 2 to 4 (the lines have 0-based index, check help every).
Script:
### plot specific lines from files from different directories
reset session
k1values = "1 2"
k2values = "1 2"
string1 = '3'
myPath(i,j) = sprintf("%s-%s-%s",string1,word(k1values,i),word(k2values,j))
myFile(i,j) = sprintf("%s/%s",myPath(i,j),"Data.dat")
set key out
plot for [i=1:words(k1values)] for[j=1:words(k2values)] myFile(i,j) \
u 1:2 every ::1::3 w lp pt 7 ti myPath(i,j)
### end of script
Result:

This is my final solution:
k1values = '0.5 1'
k2values = '0.5 1'
omega = 3
do for[i in k1values]{
do for[j in k2values]{
savingPoint = system('head -n 1 "3-'.i.'-'.j.'/<fileName>.dat" | tail -1')
number = savingPoint/<value>
do for[m = savingPoint:0:-<value>]{
set title <...>
set output <...>
plot ''.omega.'-'.i.'-'.j.'/Data/'.m.'.txt' <...>
}
}
}
<...> is a placeholder and irrelevant.
So this is how I finally iterate over the folders.
Within the second for loop, a reading command is executed and allocated to a variable which is needed in the third for loop. i and j are strings though, but that does not matter.

Related

awk calculate euclidean distance results in wrong output

I have this small geo location dataset.
37.9636140,23.7261360
37.9440840,23.7001760
37.9637190,23.7258230
37.9901450,23.7298770
From a random location.
For example this one 37.97570, 23.66721
I need to create a bash command with awk that returns the distances with simple euclidean distance.
This is the command i use
awk -v OFMT=%.17g -F',' -v long=37.97570 -v lat=23.66721 '{for (i=1;i<=NR;i++) distances[i]=sqrt(($1 - long)^2 + ($2 - lat)^2 ); a[i]=$1; b[i]=$2} END {for (i in distances) print distances[i], a[i], b[i]}' filename
When I run this command i get this weird result which is not correct, could someone explain to me what am I doing wrong?
➜ awk -v OFMT=%.17g -F',' -v long=37.97570 -v lat=23.66721 '{for (i=1;i<=NR;i++) distances[i]=sqrt(($1 - long)^2 + ($2 - lat)^2 ); a[i]=$1; b[i]=$2} END {for (i in distances) print distances[i], a[i], b[i]}' filename
44,746962127881936 37.9440840 23.7001760
44,746962127881936 37.9901450 23.7298770
44,746962127881936 37.9636140 23.7261360
44,746962127881936
44,746962127881936 37.9637190 23.7258230
Updated.
Appended the command that #jas provided, I included od -c as #mark-fuso suggetsted.
The issue now is that I get different results from #jas
Command output which showcases the new issue.
awk -v OFMT=%.17g -F, -v long=37.97570 -v lat=23.66721 '
{distance=sqrt(($1 - long)^2 + ($2 - lat)^2 ); print distance, $1, $2}
' file
1,1820150904705098 37.9636140 23.7261360
1,1820150904705098 37.9440840 23.7001760
1,1820150904705098 37.9637190 23.7258230
1,1820150904705098 37.9901450 23.7298770
od -c that shows the content of the input file.
od -c file
0000000 3 7 . 9 6 3 6 1 4 0 , 2 3 . 7 2
0000020 6 1 3 6 0 \n 3 7 . 9 4 4 0 8 4 0
0000040 , 2 3 . 7 0 0 1 7 6 0 \n 3 7 . 9
0000060 6 3 7 1 9 0 , 2 3 . 7 2 5 8 2 3
0000100 0 \n 3 7 . 9 9 0 1 4 5 0 , 2 3 .
0000120 7 2 9 8 7 7 0 \n
0000130
While #jas has provided a 'fix' for the problem, thought I'd throw in a few comments about what OP's code is doing ...
Some basics ...
the awk program ({for (i=1;i<=NR;i++) ... ; b[i]=$2}) is applied against each row of the input file
as each row is read from the input file the awk variable NR keeps track of the row number (ie, NR=1 for the first row, NR=2 for the second row, etc)
on the last pass through the for loop the counter (i in this case) will have a value of NR+1 (ie, the i++ is applied on the last pass through the loop thus leaving i=NR+1)
unless there are conditional checks for each line of input the awk program will apply against every line from the input file (including blank lines - more on this below)
for (i in distances)... isn't guaranteed to process the array indices in numerical order
The awk/for loop is doing the following:
for the 1st input row (NR=1) we get for (i=1;i<=1;i++) ...
for the 2nd input row (NR=2) we get for (i=1;i<=2;i++) ...
for the 3rd input row (NR=3) we get for (i=1;i<=3;i++) ...
for the 4th input row (NR=4) we get for (i=1;i<=4;i++) ...
For each row processed by awk the program will overwrite all previous entries in the distance[] array; net result is the last row (NR=4) will place the same values in all 4 entries of the the distance[] array.
The a[i]=$1; b[i]=$2 array assignments occur outside the scope of the for loop so these will be assigned once per input row (ie, will not be overwritten) however, the array assignments are being made with i=NR+1; net result is the contents of the 1st row (NR=1) are stored in array entries a[2] and b[2], the contents of the 2nd row (NR=2) are stored in array entries a[3] and a[3], etc.
Modifying OP's code with print i, distances[i], a[i], b[i]} and running against the 4-line input file I get:
1 0.064310270672728084 # no data for 2nd/3rd columns because a[1] and b[1] are never set
2 0.064310270672728084 37.9636140 23.7261360 # 2nd/3rd columns are from 1st row of input
3 0.064310270672728084 37.9440840 23.7001760 # 2nd/3rd columns are from 2nd row of input
4 0.064310270672728084 37.9637190 23.7258230 # 2nd/3rd columns are from 3rd row of input
From this we can see the first column of output is the same (ie, distance[1]=distance[2]=distance[3]=distance[4]), while the 2nd and 3rd columns are the same as the input columns except they are shifted 'down' by one row.
That leaves us with two outstanding issues ...
why does OP show 5 lines of output?
why is the first column consist of the garbage 44,746962127881936?
I was able to reproduce this issue by adding a blank line on the end of my input file:
$ cat geo.dat
37.9636140,23.7261360
37.9440840,23.7001760
37.9637190,23.7258230
37.9901450,23.7298770
<<=== blank line !!
Which generates the following with OP's awk code:
44.746962127881936
44.746962127881936 37.9636140 23.7261360
44.746962127881936 37.9440840 23.7001760
44.746962127881936 37.9637190 23.7258230
44.746962127881936 37.9901450 23.7298770
NOTES:
this order is different from OP's sample output and is likely due to OP's awk version not processing for (i in distances)... in numerical order; OP can try something like for (i=1;i<=NR;i++)... or for (i=1;i in distances; i++)... (though the latter will not work correcly for a sparsely populated array)
OPs output (in the question; in comment to #jas' answer) shows a comma (,) in place of the period (.) for the first column so I'm guessing OP's env is using a locale that switches the comma/period as thousands/decimal delimiter (though the input data is based on an 'opposite' locale)
Notice we finally get to see the data from the 4th line of input (shifted 'down' and displayed on line 5) but the first column has what appears to be a nonsensical value ... which can be tracked back to applying the following against a blank line:
sqrt(($1 - long)^2 + ($2 - lat)^2 )
sqrt(( - long)^2 + ( - lat)^2 ) # empty line => $1 = $2 = undefined/empty
sqrt(( - 37.97570)^2 + ( - 23.66721^2 )
sqrt( 1442.153790 + 560.136829 )
sqrt( 2002.290619 )
44.746952... # contents of 1st column
To 'fix' this issue the OP can either a) remove the blank line from the input file or b) add some logic to the awk script to only perform calculations if the input line has (numeric) values in fields #1 & #2 (ie, $1 and $2 are not empty); it's up to the coder to decide on how much validation to apply (eg, are the fields numeric, are the fields within the bounds of legitimate long/lat values, etc).
One last design-related comment ... as demonstrated in jas' answer there is no need for any of the arrays (which in turn reduces memory usage) when all desired output can generated 'on-the-fly' while processing each line of the input file.
Awk takes care of the looping for you. The code will be run in turn for each line of the input file:
$ awk -v OFMT=%.17g -F, -v long=37.97570 -v lat=23.66721 '
{distance=sqrt(($1 - long)^2 + ($2 - lat)^2 ); print distance, $1, $2}
' file
0.060152679674309095 37.9636140 23.7261360
0.045676346307474212 37.9440840 23.7001760
0.059824979147508742 37.9637190 23.7258230
0.064310270672728084 37.9901450 23.7298770
EDIT:
OP is getting different results. I notice in OP's output that there are commas instead of decimal points when printing the distance. This points to a possible issue with the locale setting.
OP confirms that the locale was set for greek, causing the difference in output.

Open file with formatted variable name in Julia

I have a list of files numbered gll_01.tab, gll_02.tab, ...., gll_20.tab in a subdirectory of my parent directory. These files are tabular data files.
I want to open/read files with user-specified input.
I can do:
a = 3
open("directory/gll_0$a.tab")
But using this approach, I would have to define two separate variable names for (01 to 09) and for (10 to 18). How can I use variables or strings with name 02, 03, ..., etc?
In python, I can have an equivalent command:
a = 4
g = '{:02d}'.format(a)
f = open('directory/gll_%s.tab' %g)
Is there an equivalent string formatting command in Julia?
A simple answer in this case would be to use lpad:
a = 3
open("directory/gll_$(lpad(a,2,"0")).tab")
If you need more fancy formatting you can use e.g. https://github.com/JuliaIO/Formatting.jl, in this case this would be:
using Formatting
a = 3
open("directory/gll_$(fmt("0>2", a)).tab")
Another option is to use #sprintf, docs are here. With that you can use %02d as a formatting option that would pad a digit d to length 2 with 0s preceding it:
julia> using Printf # this is in the standard library
julia> #sprintf("directory/gll_%02d.tab", 1)
"directory/gll_01.tab"
You can use this in your open statements too. Here they are in action:
julia> for i in 5:10
println("$i file is: $(#sprintf("directory/gll_%02d.tab",i))")
end
5 file is: directory/gll_05.tab
6 file is: directory/gll_06.tab
7 file is: directory/gll_07.tab
8 file is: directory/gll_08.tab
9 file is: directory/gll_09.tab
10 file is: directory/gll_10.tab

Print words from the corresponding line numbers

Hello Everyone,
I have two files File1 and File2 which has the following data.
File1:
TOPIC:topic_0 30063951.0
2 19195200.0
1 7586580.0
3 2622580.0
TOPIC:topic_1 17201790.0
1 15428200.0
2 917930.0
10 670854.0
and so on..There are 15 topics and each topic have their respective weights. And the first column like 2,1,3 are the numbers which have corresponding words in file2. For example,
File 2 has:
1 i
2 new
3 percent
4 people
5 year
6 two
7 million
8 president
9 last
10 government
and so on.. There are about 10,470 lines of words. So, in short I should have the corresponding words in the first column of file1 instead of the line numbers. My output should be like:
TOPIC:topic_0 30063951.0
new 19195200.0
i 7586580.0
percent 2622580.0
TOPIC:topic_1 17201790.0
i 15428200.0
new 917930.0
government 670854.0
My Code:
import sys
d1 = {}
n = 1
with open("ap_vocab.txt") as in_file2:
for line2 in in_file2:
#print n, line2
d1[n] = line2[:-1]
n = n + 1
with open("ap_top_t15.txt") as in_file:
for line1 in in_file:
columns = line1.split(' ')
firstwords = columns[0]
#print firstwords[:-8]
if firstwords[:-8] == 'TOPIC':
print columns[0], columns[1]
elif firstwords[:-8] != '\n':
num = columns[0]
print d1[n], columns[1]
This code is running when I type print d1[2], columns[1] giving the second word in file2 for all the lines. But when the above code is printed, it is giving an error
KeyError: 10472
there are 10472 lines of words in the file2. Please help me with what I should do to rectify this. Thanks in advance!
In your first for loop, n is incremented with each line until reaching a final value of 10472. You are only setting values for d1[n] up to 10471 however, as you have placed the increment after you set d1 for your given n, with these two lines:
d1[n] = line2[:-1]
n = n + 1
Then on the line
print d1[n], columns[1]
in your second for loop (for in_file), you are attempting to access d1[10472], which evidently doesn't exist. Furthermore, you are defining d1 as an empty Dictionary, and then attempting to access it as if it were a list, such that even if you fix your increment you will not be able to access it like that. You must either use a list with d1 = [], or will have to implement an OrderedDict so that you can access the "last" key as dictionaries are typically unordered in Python.
You can either:
Alter your increment so that you do set a value for d1 in the d1[10472] position, or simply set the value for the last position after your for loop.
Depending on what you are attempting to print out, you could replace your last line with
print d1[-1], columns[1]
to print out the value for the final index position you currently have set.

Awk - Separate one .txt file to files by condition

I have one problem, I would like to separate one file by condition to more files.
INPUT: One text file
variable chrom=chr1
1000 10
1010 20
1020 10
vriable chrom=chr2
1000 20
1100 30
1200 10
OUTPUT: two files for this example.
chr1.txt
variable chrom=chr1
1000 10
1010 20
1020 10
chr2.txt
variable chrom=chr2
1000 20
1100 30
1200 10
So, the separator condition if row starts with chrom=chr$i (i={1..22}) => separate to other text file.
Thank you
Something along these lines:
awk 'BEGIN { filename="unknown.txt" } /^variable chrom=/ { close(filename); filename = substr($0, index($0, "=") + 1) ".txt"; } { print > filename }'
Where the awk code is
BEGIN { filename="unknown.txt" } # default file name, used only if the
# file doesn't start with a variable chrom=
# line
/^variable chrom=/ { # in such a line:
close(filename) # close the previous file (if open)
# and set the new filename
filename = substr($0, index($0, "=") + 1) ".txt" filename
}
{ print > filename } # print everything to the current file.
The basic algorithm is very straightforward: Read file linewise, change filename when you find a line that starts a new section, always print the current line to the current file, so the devil is in the detail of isolating the file name from the marker line. The
filename = substr($0, index($0, "=") + 1) ".txt"
approach is simplistic but serviceable for the example you showed: It takes everything after the = and attaches .txt to get the file name. If your marker lines are more complicated than variable chrom=filenamestub, this will have to be amended, but in that case I could only guess your requirements and would probably guess wrong.
If you know how many lines there are between, you could use
split -l 4 textfile.txt
This will split the textfile every 4th line it finds, making the files xaa and xab, and so on.

Insert different prefixes to alternating lines of a file

I'm having trouble creating 2 commands that insert one word (different between the commands) at the beginning of every line with step = 2.
For example:
Before:
10
10
10
10
After:
group1 10
group2 10
group1 10
group2 10
So what I would want is that 1 command starts inserting the word 'group1' to every odd line, while the second command inserts the word 'group2 to every even.
The number 10 is chosen randomly as a substitute for my data numbers
Hope you could help me with this.
Cheers,
You can do this with sed, here handling odd and even lines separately:
sed '1~2 s/^/group1 /' original.txt | sed '2~2 s/^/group2 /' >modified.txt
The 1~2 matches every second line starting with first, and 2~2 matches every second line starting from second. "s" substitutes, "^" matches the start of the line

Resources