Print words from the corresponding line numbers - file

Hello Everyone,
I have two files File1 and File2 which has the following data.
File1:
TOPIC:topic_0 30063951.0
2 19195200.0
1 7586580.0
3 2622580.0
TOPIC:topic_1 17201790.0
1 15428200.0
2 917930.0
10 670854.0
and so on..There are 15 topics and each topic have their respective weights. And the first column like 2,1,3 are the numbers which have corresponding words in file2. For example,
File 2 has:
1 i
2 new
3 percent
4 people
5 year
6 two
7 million
8 president
9 last
10 government
and so on.. There are about 10,470 lines of words. So, in short I should have the corresponding words in the first column of file1 instead of the line numbers. My output should be like:
TOPIC:topic_0 30063951.0
new 19195200.0
i 7586580.0
percent 2622580.0
TOPIC:topic_1 17201790.0
i 15428200.0
new 917930.0
government 670854.0
My Code:
import sys
d1 = {}
n = 1
with open("ap_vocab.txt") as in_file2:
for line2 in in_file2:
#print n, line2
d1[n] = line2[:-1]
n = n + 1
with open("ap_top_t15.txt") as in_file:
for line1 in in_file:
columns = line1.split(' ')
firstwords = columns[0]
#print firstwords[:-8]
if firstwords[:-8] == 'TOPIC':
print columns[0], columns[1]
elif firstwords[:-8] != '\n':
num = columns[0]
print d1[n], columns[1]
This code is running when I type print d1[2], columns[1] giving the second word in file2 for all the lines. But when the above code is printed, it is giving an error
KeyError: 10472
there are 10472 lines of words in the file2. Please help me with what I should do to rectify this. Thanks in advance!

In your first for loop, n is incremented with each line until reaching a final value of 10472. You are only setting values for d1[n] up to 10471 however, as you have placed the increment after you set d1 for your given n, with these two lines:
d1[n] = line2[:-1]
n = n + 1
Then on the line
print d1[n], columns[1]
in your second for loop (for in_file), you are attempting to access d1[10472], which evidently doesn't exist. Furthermore, you are defining d1 as an empty Dictionary, and then attempting to access it as if it were a list, such that even if you fix your increment you will not be able to access it like that. You must either use a list with d1 = [], or will have to implement an OrderedDict so that you can access the "last" key as dictionaries are typically unordered in Python.
You can either:
Alter your increment so that you do set a value for d1 in the d1[10472] position, or simply set the value for the last position after your for loop.
Depending on what you are attempting to print out, you could replace your last line with
print d1[-1], columns[1]
to print out the value for the final index position you currently have set.

Related

awk calculate euclidean distance results in wrong output

I have this small geo location dataset.
37.9636140,23.7261360
37.9440840,23.7001760
37.9637190,23.7258230
37.9901450,23.7298770
From a random location.
For example this one 37.97570, 23.66721
I need to create a bash command with awk that returns the distances with simple euclidean distance.
This is the command i use
awk -v OFMT=%.17g -F',' -v long=37.97570 -v lat=23.66721 '{for (i=1;i<=NR;i++) distances[i]=sqrt(($1 - long)^2 + ($2 - lat)^2 ); a[i]=$1; b[i]=$2} END {for (i in distances) print distances[i], a[i], b[i]}' filename
When I run this command i get this weird result which is not correct, could someone explain to me what am I doing wrong?
➜ awk -v OFMT=%.17g -F',' -v long=37.97570 -v lat=23.66721 '{for (i=1;i<=NR;i++) distances[i]=sqrt(($1 - long)^2 + ($2 - lat)^2 ); a[i]=$1; b[i]=$2} END {for (i in distances) print distances[i], a[i], b[i]}' filename
44,746962127881936 37.9440840 23.7001760
44,746962127881936 37.9901450 23.7298770
44,746962127881936 37.9636140 23.7261360
44,746962127881936
44,746962127881936 37.9637190 23.7258230
Updated.
Appended the command that #jas provided, I included od -c as #mark-fuso suggetsted.
The issue now is that I get different results from #jas
Command output which showcases the new issue.
awk -v OFMT=%.17g -F, -v long=37.97570 -v lat=23.66721 '
{distance=sqrt(($1 - long)^2 + ($2 - lat)^2 ); print distance, $1, $2}
' file
1,1820150904705098 37.9636140 23.7261360
1,1820150904705098 37.9440840 23.7001760
1,1820150904705098 37.9637190 23.7258230
1,1820150904705098 37.9901450 23.7298770
od -c that shows the content of the input file.
od -c file
0000000 3 7 . 9 6 3 6 1 4 0 , 2 3 . 7 2
0000020 6 1 3 6 0 \n 3 7 . 9 4 4 0 8 4 0
0000040 , 2 3 . 7 0 0 1 7 6 0 \n 3 7 . 9
0000060 6 3 7 1 9 0 , 2 3 . 7 2 5 8 2 3
0000100 0 \n 3 7 . 9 9 0 1 4 5 0 , 2 3 .
0000120 7 2 9 8 7 7 0 \n
0000130
While #jas has provided a 'fix' for the problem, thought I'd throw in a few comments about what OP's code is doing ...
Some basics ...
the awk program ({for (i=1;i<=NR;i++) ... ; b[i]=$2}) is applied against each row of the input file
as each row is read from the input file the awk variable NR keeps track of the row number (ie, NR=1 for the first row, NR=2 for the second row, etc)
on the last pass through the for loop the counter (i in this case) will have a value of NR+1 (ie, the i++ is applied on the last pass through the loop thus leaving i=NR+1)
unless there are conditional checks for each line of input the awk program will apply against every line from the input file (including blank lines - more on this below)
for (i in distances)... isn't guaranteed to process the array indices in numerical order
The awk/for loop is doing the following:
for the 1st input row (NR=1) we get for (i=1;i<=1;i++) ...
for the 2nd input row (NR=2) we get for (i=1;i<=2;i++) ...
for the 3rd input row (NR=3) we get for (i=1;i<=3;i++) ...
for the 4th input row (NR=4) we get for (i=1;i<=4;i++) ...
For each row processed by awk the program will overwrite all previous entries in the distance[] array; net result is the last row (NR=4) will place the same values in all 4 entries of the the distance[] array.
The a[i]=$1; b[i]=$2 array assignments occur outside the scope of the for loop so these will be assigned once per input row (ie, will not be overwritten) however, the array assignments are being made with i=NR+1; net result is the contents of the 1st row (NR=1) are stored in array entries a[2] and b[2], the contents of the 2nd row (NR=2) are stored in array entries a[3] and a[3], etc.
Modifying OP's code with print i, distances[i], a[i], b[i]} and running against the 4-line input file I get:
1 0.064310270672728084 # no data for 2nd/3rd columns because a[1] and b[1] are never set
2 0.064310270672728084 37.9636140 23.7261360 # 2nd/3rd columns are from 1st row of input
3 0.064310270672728084 37.9440840 23.7001760 # 2nd/3rd columns are from 2nd row of input
4 0.064310270672728084 37.9637190 23.7258230 # 2nd/3rd columns are from 3rd row of input
From this we can see the first column of output is the same (ie, distance[1]=distance[2]=distance[3]=distance[4]), while the 2nd and 3rd columns are the same as the input columns except they are shifted 'down' by one row.
That leaves us with two outstanding issues ...
why does OP show 5 lines of output?
why is the first column consist of the garbage 44,746962127881936?
I was able to reproduce this issue by adding a blank line on the end of my input file:
$ cat geo.dat
37.9636140,23.7261360
37.9440840,23.7001760
37.9637190,23.7258230
37.9901450,23.7298770
<<=== blank line !!
Which generates the following with OP's awk code:
44.746962127881936
44.746962127881936 37.9636140 23.7261360
44.746962127881936 37.9440840 23.7001760
44.746962127881936 37.9637190 23.7258230
44.746962127881936 37.9901450 23.7298770
NOTES:
this order is different from OP's sample output and is likely due to OP's awk version not processing for (i in distances)... in numerical order; OP can try something like for (i=1;i<=NR;i++)... or for (i=1;i in distances; i++)... (though the latter will not work correcly for a sparsely populated array)
OPs output (in the question; in comment to #jas' answer) shows a comma (,) in place of the period (.) for the first column so I'm guessing OP's env is using a locale that switches the comma/period as thousands/decimal delimiter (though the input data is based on an 'opposite' locale)
Notice we finally get to see the data from the 4th line of input (shifted 'down' and displayed on line 5) but the first column has what appears to be a nonsensical value ... which can be tracked back to applying the following against a blank line:
sqrt(($1 - long)^2 + ($2 - lat)^2 )
sqrt(( - long)^2 + ( - lat)^2 ) # empty line => $1 = $2 = undefined/empty
sqrt(( - 37.97570)^2 + ( - 23.66721^2 )
sqrt( 1442.153790 + 560.136829 )
sqrt( 2002.290619 )
44.746952... # contents of 1st column
To 'fix' this issue the OP can either a) remove the blank line from the input file or b) add some logic to the awk script to only perform calculations if the input line has (numeric) values in fields #1 & #2 (ie, $1 and $2 are not empty); it's up to the coder to decide on how much validation to apply (eg, are the fields numeric, are the fields within the bounds of legitimate long/lat values, etc).
One last design-related comment ... as demonstrated in jas' answer there is no need for any of the arrays (which in turn reduces memory usage) when all desired output can generated 'on-the-fly' while processing each line of the input file.
Awk takes care of the looping for you. The code will be run in turn for each line of the input file:
$ awk -v OFMT=%.17g -F, -v long=37.97570 -v lat=23.66721 '
{distance=sqrt(($1 - long)^2 + ($2 - lat)^2 ); print distance, $1, $2}
' file
0.060152679674309095 37.9636140 23.7261360
0.045676346307474212 37.9440840 23.7001760
0.059824979147508742 37.9637190 23.7258230
0.064310270672728084 37.9901450 23.7298770
EDIT:
OP is getting different results. I notice in OP's output that there are commas instead of decimal points when printing the distance. This points to a possible issue with the locale setting.
OP confirms that the locale was set for greek, causing the difference in output.

COBOL85: How to find number of rows in an array dynamically

In my program I keep filling the following array with data obtained from a database table then inspect it to find certain words:
01 PRODUCTS-TABLE.
03 PRODUCT-LINE PIC X(40) OCCURS 50 TIMES.
sometimes it occurs 6 times, sometimes more than 6 times.
I'd like to find the number of lines in the array every time I write data to it , how can I do that ?
I tried this but it based on a fixed length:
INSPECT-PROCESS.
MOVE 0 TO TALLY-1.
INSPECT PRODUCTS-TABLE TALLYING TALLY-1 FOR ALL "PRODUCT"
IF TALLY-1 > 0
MOVE SER-NUMBER TO HITS-SN-OUTPUT
MOVE FILLER-SYM TO FILLER-O
MOVE PRODUCT-LINE(1) TO HITS-PR-OUTPUT
WRITE HITS-REC
PERFORM WRITE-REPORT VARYING CNT1 FROM 2 BY 1 UNTIL CNT1 = 11.
WRITE-REPORT.
MOVE " " TO HITS-SN-OUTPUT
MOVE PRODUCT-LINE(CNT1) TO HITS-TX-OUTPUT
WRITE HITS-REC.
In the first output line it writes the SN and the first product-line then in the following lines it writes all remaining product-line and blank out SN.
Something like:
12345678 first product-line
Second product-line
etc
It’s working, however, it only stops when CNT1 is 11, how can I feed the procedure with a variable CNT1 based on how many lines are actually in PRODUCTS-TABLE each time?
I solved the problem by adding an array line counter (LINE-COUNTER-1) to count (ADD 1 TO LINE-COUNTER-1) how many times I add data to the array and stop writing the report when "WRITE-COUNTER = LINE-COUNTER-1"
INSPECT-PROCESS.
MOVE 0 TO TALLY-1
INSPECT PRODUCTS-TABLE TALLYING TALLY-1 FOR ALL "PRODUCT"
IF TALLY-1 > 0
MOVE HOLD-SER-NUM TO HITS-SN-OUTPUT
MOVE FILLER-SYM TO FILLER-O
MOVE PRODUCT-LINE(1) TO HITS-PR-OUTPUT
WRITE HITS-REC
PERFORM WRITE-REPORT VARYING WRITE-COUNTER FROM 2 BY 1
UNTIL WRITE-COUNTER = LINE-COUNTER-1.

Find a list of max values in a text file using awk

I am new to awk and I cannot figure out the correct syntax for the task I am working on.
I have a text file which looks something like this (the content is always sorted but is not always the same, so I cannot hard code the index of the array):
27 abc123
27 abd333
27 dce123
23 adb234
21 abc789
18 bcd213
So apparently the max is 27. However, I want my output to be:
27 abc123
27 abd333
27 dce123
and not the first row only.
The second column is just there, my code always sorts the text file based on the first column.
My code right now set the max as the first value (27 for example), and as it reads through the lines, it stores only the rows with the max values in an array and eventually print out the output.
awk 'BEGIN {max=$1} {if(($1)==max) a[NR]=($0)} END {for (i in a) print a[i]}' file
You can't read fields in a BEGIN block, since it's executed before the file is read.
To find the first record, use the pattern NR == 1. NR is the number of the current record. To find the other records, just check whether $1 equals the max value.
NR == 1 { max = $1 }
$1 == max { print }
Since your input is always sorted, you can optimise this program by exiting after reading all the records with the max value:
$1 != max { exit }

Read from txt file to matrix after a specific expression

I wanted via matlab to read a table of data from a txt file after a specific expression and a number of non desired lines for example the AA.txt have:
Information about students :
AAAA
BBBB
1 10 100
2 3 15
! ! ! a number of lines
10 6 9
I have like information the expression 'Information about students', the number of skipped lines 2 and the number of columns 3 and rows 10 in desired matrix.
if I understand correctly, you wanna skip the first 3 lines (assuming them as headers) and then reading the rest.
I would follow this procedure:
fid = fopen(filename,'r');
A = textscan(fid,'%f %f %f','HeaderLines',3,'Delimiter','\r\n');
I currently do not have access to MATLAB, but I do believe it will work.

Compare file with variable list AWK

I'm stumbling over myself trying to get a seemingly simple thing accomplished. I have one file, and one newline delimited string list.
File:
Dat1 Loc1
Dat2 Loc1
Dat3 Loc1
Dat4 Loc2
Dat5 Loc2
My list is something like this:
Dat1
Dat2
Dat3
Dat4
What I am trying to do is compare the list with the data file and count the number of unique Locs that appear. I am interested only in the largest count. In the example above, when comparing the list to the file, I want essentially:
Dat1 MATCHED Loc1Count = 1
Dat2 MATCHED Loc1Count = 2
Dat3 MATCHED Loc1Count = 3
Dat4 MATCHED Loc2Count = 1
Return: Loc1 if Loc1Count/Length of List > 50%
Now,
I know that awk 1 file will read a file line by line. Furthermore I know that "echo "$LIST" | awk '/search for a line that contains this/" will return the line that matches that internal string. I haven't been able to combine these ideas successfully though as nested awks, much less how to count "loc1" vs "loc2" (which, by the way, are going to be random strings, and not form-standard)
I feel like this is simple, but I'm banging my head against a wall. Any ideas? Is this sufficiently clear?
list="Dat1 Dat2 Dat3 Dat4"
awk -vli="$list" 'BEGIN{
# here list from shell is converted to awk array "list".
m=split(li,list," ")
}
{
# go through the list
for(i=1;i<=m;i++){
if($1 == list[i]){
# if Dat? is found in list, print , at the same time
print $1" matched Locount="$2" "++data[$2] # increment the count for $2 and store in loc array
loc[$2]++
}
}
}
END{
# here returns loc1 count
loc1count=loc["Loc1"]
if(( loc1count / m *100 ) > 50) {
print "Loc1 count: "loc1count
}
} ' file
output
$ ./shell.sh
Dat1 matched Locount=Loc1 1
Dat2 matched Locount=Loc1 2
Dat3 matched Locount=Loc1 3
Dat4 matched Locount=Loc2 1
Loc1 count: 3

Resources