cut/edit row lines and then transpose rows to column arrays - arrays

I've greatly simplified the values in the question
I have 500k lines of text in a space-delim file that I need to extract x, y values for plotting from each row and transpose those rows to columns. I'm trying to determine the most efficient way of doing this, and I've most been working in a bash script using sed and awk -- but if there are better ways of doing this, I'm open to suggestions.
The following code block is a simple example of the first 3 lines of text. The first 4 values (a,b,c,num) can be ignored but then each pair of values (separated by 0.000) needs to be cut and placed into a column and I need to do this for every line. The 2, 3 or 2 (right after c) tells us how many x,y pairs follow in the line, starting with 2500 and 500
Sample input:
a b c 2 2500.0 500.0 0.000 0.0 10.0
a b c 3 2000.0 450.0 0.000 1000.0 400.0 0.000 0.0 12.0
a b c 2 1800.0 475.0 0.000 0.0 15.0
Expected output:
2500.0 500.0 2000.0 450.0 1800.0 475.0
0.0 10.0 1000.0 400.0 0.0 15.0
0.0 12.0

I had a flash of inspiration - is this what you're trying to do:
$ cat tst.awk
BEGIN { OFS="\t" }
{
rowNr = 0
for ( i=6; i<NF; i+=3 ) {
val = $i OFS $(i+1)
vals[++rowNr,NR] = val
}
numRows = ( rowNr > numRows ? rowNr : numRows )
}
END {
numCols = NR
for ( rowNr=1; rowNr<=numRows; rowNr++ ) {
for ( colNr=1; colNr<=numCols; colNr++ ) {
val = ( (rowNr,colNr) in vals ? vals[rowNr,colNr] : OFS )
printf "%s%s", val, (colNr<numCols ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
2500.0 500.0 2000.0 450.0 1800.0 475.0
0.0 10.0 1000.0 400.0 0.0 15.0
0.0 12.0
Previous answers:
Using your new example:
$ awk '{for (i=6;i<NF;i+=3) print $i, $(i+1)}' file
2500.0 500.0
0.0 10.0
2000.0 450.0
1000.0 400.0
0.0 12.0
1800.0 475.0
0.0 15.0
Original answer:
$ awk '{for (i=5;i<NF;i+=3) print $i, $(i+1)}' file
2328.948975 287.601868
2091.017578 168.763153
1949.201782 38.351677
1926.619385 43.498230
1808.323120 91.477585
1792.807861 60.784626
1650.532471 25.608397
1630.271484 54.119873
1615.212891 4.026413
1502.170288 118.276688
1176.283447 43.057251
1119.772705 56.983471
944.500000 81.461624
937.491516 107.800484
726.215515 181.033768
641.585510 82.066551
443.907104 27.303604
362.327789 90.082993
348.752777 39.833252
61.434803 44.582367
0.000000 7.181455

Related

Pairwise Cohen's Kappa of rows in DataFrame in Pandas (python)

I'd greatly appreciate some help on this. I'm using jupyter notebook.
I have a dataframe where I want calculate the interrater reliability. I want to compare them pairwise by the value of the ID column (all IDs have a frequency of 2, one for each coder). All ID values represent different articles, so I do not want to compare them all together, but more take the average of the interrater reliability of each pair (and potentially also for each column).
N. ID. A. B.
0 8818313 Yes Yes 1.0 1.0 1.0 1.0 1.0 1.0
1 8818313 Yes No 0.0 1.0 0.0 0.0 1.0 1.0
2 8820105 No Yes 0.0 1.0 1.0 1.0 1.0 1.0
3 8820106 No No 0.0 0.0 0.0 1.0 0.0 0.0
I've been able to find some instructions of the cohen's k, but not of how to do this pairwise by value in the ID column.
Does anyone know how to go about this?
Here is how I will approach it:
from io import StringIO
from sklearn.metrics import cohen_kappa_score
df = pd.read_csv(StringIO("""
N,ID,A,B,Nums
0, 8818313, Yes, Yes,1.0 1.0 1.0 1.0 1.0 1.0
1, 8818313, Yes, No,0.0 1.0 0.0 0.0 1.0 1.0
2, 8820105, No, Yes,0.0 1.0 1.0 1.0 1.0 1.0
3, 8820105, No, No,0.0 0.0 0.0 1.0 0.0 0.0 """))
def kappa(df):
nums1 = [float(num) for num in df.Nums.iloc[0].split(' ') if num]
nums2 = [float(num) for num in df.Nums.iloc[1].split(' ') if num]
return cohen_kappa_score(nums1, nums2)
df.groupby('ID').apply(kappa)
This will generate:
ID
8818313 0.000000
8820105 0.076923
dtype: float64

Use for loop to get i:i+1 columns

I want to use for loop so I can get the result for i:i+1 columns at each iteration. E.g: when i= 1, I get 1st & 2nd cols, i= 2 , 3rd and 4th cols.
m= [2 -3;4 6]
al= [1 3;-2 -4]
l=[1 0; 2 4]
Random.seed!(1234)
d= [rand(2:20, 10) rand(-1:10, 10)]
Random.seed!(1234)
c= [rand(0:30, 10) rand(-1:20, 10)]
n,g=size(w)
mx=zeros(n,2*g)
for i =1:g
mx[ : ,i:i+1] = m[:,i]' .+ (d[:,i]' .* al[:,i])' .+ (l[:,i]' .* c[:,i])
end
return mx
I got the following which is wrong when I compared it with doing the process manually
julia> mx
10×4 Array{Float64,2}:
8.0 -6.0 10.0 0.0
22.0 3.0 18.0 0.0
40.0 18.0 38.0 0.0
9.0 9.0 22.0 0.0
51.0 3.0 18.0 0.0
10.0 18.0 34.0 0.0
52.0 12.0 30.0 0.0
36.0 21.0 42.0 0.0
44.0 24.0 42.0 0.0
33.0 24.0 42.0 0.0
The first 2 cols and last 2 cols in mx should be matched the results here with same order
mx1_2=m[:,1]' .+ (d[:,1]' .* al[:,1])' .+ (l[:,1]' .* c[:,1])
mx2_4=m[:,2]' .+ (d[:,2]' .* al[:,2])' .+ (l[:,2]' .* c[:,2])

Output a for loop into a array

I need the below 4th row idle outputs to be put into a array and then take a average out of the same . below is lparstat output of an aix system.
$ lparstat 2 10
System configuration: type=Shared mode=Uncapped smt=4 lcpu=16 mem=8192MB psize=16 ent=0.20
%user %sys %wait %idle physc %entc lbusy app vcsw phint %nsp %utcyc
----- ----- ------ ------ ----- ----- ------ --- ----- ----- ----- ------
2.6 1.8 0.0 95.5 0.02 9.5 0.0 5.05 270 0 101 1.42
2.8 1.6 0.0 95.6 0.02 9.9 1.9 5.38 258 0 101 1.42
0.5 1.4 0.0 98.1 0.01 5.5 2.9 5.17 265 0 101 1.40
2.8 1.3 0.0 95.8 0.02 8.9 0.0 5.37 255 0 101 1.42
2.8 2.0 0.0 95.2 0.02 10.8 1.9 4.49 264 0 101 1.42
4.2 1.7 0.0 94.1 0.02 12.2 0.0 3.66 257 0 101 1.42
0.5 1.5 0.0 98.0 0.01 6.3 1.9 3.35 267 0 101 1.38
3.1 2.0 0.0 94.9 0.02 12.1 2.9 3.07 367 0 101 1.41
2.3 2.2 0.0 95.5 0.02 9.8 0.0 3.40 259 0 101 1.42
25.1 25.5 0.0 49.4 0.18 89.6 2.6 2.12 395 0 101 1.44
I have made a script like this but need to press enter to get the output .
$ for i in ` lparstat 2 10 | tail -10 | awk '{print $4}'`
> do
> read arr[$i]
> echo arr[$i]
> done
arr[94.0]
arr[97.7]
arr[94.9]
arr[91.0]
arr[98.1]
arr[97.7]
arr[93.0]
arr[94.8]
arr[97.9]
arr[89.2]
Your script only needs a small improvement to calculate the average. You can do that inside awk right away:
lparstat 2 10 | tail -n 10 | awk '{ sum += $4 } END { print sum / NR }'
The tail -n 10 takes 10 last lines.
{ sum += $4 } is calculated for each line - it sums the values at 4th column.
Then END block executes after the whole file is read. The { print sum / NR } prints the average. NR is "Number of Records", where one record is one line, so it's number of lines.
Notes:
backticks ` are discouraged. The modern $( ... ) syntax is much preferred.
The "for i in `cmd`" or more commonly for i in $(...) is a common antipattern in bash. Use while read -r line when reading lines from a command, like cmd | while read -r line; do echo "$line"; done or in bash while read -r line; do echo "$line"; done < <(cmd)

Comparing two columns and summing the values in Matlab

I have 2 columns like this:
0.0 1.2
0.0 2.3
0.0 1.5
0.1 1.0
0.1 1.2
0.1 1.4
0.1 1.7
0.4 1.1
0.4 1.3
0.4 1.5
In the 1st column, 0.0 is repeated 3 times. I want to sum corresponding elements
(1.2 + 2.3 + 1.5) in the 2nd column. Similarly, 0.1 is repeated 4 times in the 1st
column. I want to sum the corresponding elements (1.0 + 1.2 + 1.4 + 1.7) in the 2nd
column and so on.
I am trying like this
for i = 1:length(col1)
for j = 1:length(col2)
% if col2(j) == col1(i)
% to do
end
end
end
This is a classical use of unique and accumarray:
x = [0.0 1.2
0.0 2.3
0.0 1.5
0.1 1.0
0.1 1.2
0.1 1.4
0.1 1.7
0.4 1.1
0.4 1.3
0.4 1.5]; % data
[~, ~, w] = unique(x(:,1)); % labels of unique elements
result = accumarray(w, x(:,2)); % sum using the above as grouping variable
You can also use the newer splitapply function instead of accumarray:
[~, ~, w] = unique(x(:,1)); % labels of unique elements
result = splitapply(#sum, x(:,2), w); % sum using the above as grouping variable
a=[0.0 1.2
0.0 2.3
0.0 1.5
0.1 1.0
0.1 1.2
0.1 1.4
0.1 1.7
0.4 1.1
0.4 1.3
0.4 1.5]
% Get unique col1 values, and indices
[uniq,~,ib]=unique(a(:,1));
% for each unique value in col1
for ii=1:length(uniq)
% sum all col2 values that correspond to the current index of the unique value
s(ii)=sum(a(ib==ii,2));
end
Gives:
s =
5.0000 5.3000 3.9000

Multiple assignment in multidimensional array

I have a 4x4 array of zeros.
julia> X = zeros(4,4)
4x4 Array{Float64,2}:
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
I have an 2xN array containing indices of elements in X that I want to assign a new value.
julia> ind = [1 1; 2 2; 3 3]
3x2 Array{Int64,2}:
1 1
2 2
3 3
What is the simplest way to assign a value to all elements in X whose indices are rows in ind? (something like X[ind] = 2.0).
julia> X
2.0 0.0 0.0 0.0
0.0 2.0 0.0 0.0
0.0 0.0 2.0 0.0
0.0 0.0 0.0 0.0
I'm not sure there is a non-looping way to do this. What's wrong with this?
for i=[1:size(ind)[1]]
a, b = ind[i, :]
X[a, b] = 2.0
end
user3467349's answer is correct, but inefficient, because it allocates an Array for the indices. Also, the notation [a:b] is deprecated as of Julia 0.4. Instead, you can use:
for i = 1:size(ind, 1)
a, b = ind[i, :]
X[a, b] = 2.0
end

Resources