merge / append files and re-number first column in unix - file

I am many (3 just an example) text files in different directories (3 different names) like following:
Directory: A, file name: run.txt format: txt tab deliminated
; file one
10 0.2 0.5 0.3
20 0.1 0.6 0.8
30 0.2 0.1 0.1
40 0.1 0.5 0.3
Directory: B, file name: run.txt format: txt tab deliminated
; file two
10 0.2 0.1 0.2
30 0.1 0.6 0.8
50 0.2 0.1 0.1
70 0.3 0.4 0.4
Directory: C, file name: run.txt format: txt tab deliminated
; file three
10 0.3 0.3 0.3
20 0.3 0.6 0.8
30 0.1 0.1 0.1
40 0.2 0.2 0.3
I want to combine all three run.txt files into single and renumber the first column. The resulting new file will look like:
; file combined
10 0.2 0.5 0.3
20 0.1 0.6 0.8
30 0.2 0.1 0.1
40 0.1 0.5 0.3
50 0.2 0.1 0.2
70 0.1 0.6 0.8
90 0.2 0.1 0.1
110 0.3 0.4 0.4
120 0.3 0.3 0.3
130 0.3 0.6 0.8
140 0.1 0.1 0.1
150 0.2 0.2 0.3
This what my codes are at:
cat A/run.txt B/run.txt C/run.txt > combined.txt
(1) I do not know how to take care of renumbering by first column
(2) Also I do not how to take care of comment starting with ";"
Edit:
Let me be clear about the number scheme:
A/run.txt, B/run.txt and C/run.txt are actually parallel run to combined into one.
so each will have stored samples with run number. However gap can be uneven among the run.
(1) for first file A/run.txt (gap is 10, 20-10, 30-20)
10, 10+10, 20+10, 30+10
(2) for second file B/run.txt, starts from 10 but has gap of 20
(eg. 30-10, 50-70, 70-50)
40 (from last line of the first file) + 10 (first in file two) = 50,
50 + 20 = 70,70 + 20 = 90, 90+ 20 = 110
(3) file C/run.txt starts from 10 and increment is 10
110 (last number in file 2) + 10 = 120, 120+ 10 = 130,
130+10 = 140, 140+10 = 150`

You could use awk:
awk 'BEGIN{l=0;print "; file combined"}; {if($1!=";")print l,$2,$3,$4;l=l+10}' A/run.txt B/run.txt C/run.txt > combined.txt
EDIT
I made a guess about your numbering scheme (you've provided still no spec) and come up with:
awk 'BEGIN{line=0;last=0;print "; file combined"}; !/^;/{if($1<last){line=last+$1}else{line=line+$1-last;last=$1};print line,$2,$3,$4}' \
A/run.txt B/run.txt C/run.txt > combined.txt
Is it what you mean?

#!/usr/bin/awk -f
BEGIN {
OFS = "\t"
printf "%s\n", "; file combined"
}
! /^;/ {
if (FILENAME != prevfile) {
prevnum = $1
prevfile = FILENAME
interval = 10
c = 0
}
c++
if (c == 2) {
interval = $1 - prevnum
}
$1 = (i += interval)
print
}
To run it:
$ ./renumber {A,B,C}/run.txt
Given your sample input, it produces output that exactly matches your sample.

awk '{$1="";print NR"0",$0}' A/run.txt B/run.txt C/run.txt > combined.txt

This might work for you:
awk -F'[\t]' 'lastfile!=FILENAME{lastfile=FILENAME;i=l};{$1+=i;l=$1};1' A/run.txt B/run.txt C/run.txt > combined.txt

Related

Alternative to multiple padarray calls to get a perimeter mask for image

I have an array of doubles img which I use to multiple with a mask mask.*img where the mask will have values of 1 in the middle but go linearly to 0 at the borders e.g. for a 5x5 mask it would be something like
0.1 0.1 0.1 0.1 0.1
0 0.5 0.5 0.5 0.1
0.1 0.5 1 0.5 0.1
0.1 0.5 0.5 0.5 0.1
0.1 0.1 0.1 0.1 0.1
My idea for this currently is to create the center using x = ones(M)
and then create a sequence of decreasing values y = [0.9 0.5 0.3 0.1]
and then do
for k = 1: size(y)
x = padarray(x,[1 1], y(k))
which will add the values of y as a perimeter around x multiple times, one at a time. Is there a more clever way to create this kind of mask that tapers off at the perimeter?
An interesting way to do something similar might be. Where vector Taper is the same as the centre row of the 5 by 5 matrix. The rows are generated by comparing the corresponding element in the transpose with the vector Taper which is Taper.'.
Broken down into steps:
Row 1: min([0.1 0.5 1 0.5 0.1],[0.1]); → [0.1 0.1 0.1 0.1 0.1]
Row 2: min([0.1 0.5 1 0.5 0.1],[0.5]); → [0.1 0.5 0.5 0.5 0.1]
Row 3: min([0.1 0.5 1 0.5 0.1],[1]); → [0.1 0.5 1 0.5 0.1]
Row 4: min([0.1 0.5 1 0.5 0.1],[0.5]); → [0.1 0.5 0.5 0.5 0.1]
Row 5: min([0.1 0.5 1 0.5 0.1],[0.1]); → [0.1 0.1 0.1 0.1 0.1]
Taper = [0.1 0.5 1 0.5 0.1];
Result = min(Taper, Taper.');
Result

Merge 2 text files with the same first column

I need to merge this 2 files
File1
1
1
2
2
2
3
4
4
4
File2
1 A 0.2 0.8 0.3
2 B 0.4 0.3 0.2
3 C 0.8 0.9 0.5
4 D 0.6 0.7 0.8
Output should be
1 A 0.2 0.8 0.3
1 A 0.2 0.8 0.3
2 B 0.4 0.3 0.2
2 B 0.4 0.3 0.2
2 B 0.4 0.3 0.2
3 C 0.8 0.9 0.5
4 D 0.6 0.7 0.8
4 D 0.6 0.7 0.8
4 D 0.6 0.7 0.8
If you are using python and pandas then it's not too difficult I guess
d1 = pd.read_csv('doc1.txt',sep=" ",header=None)
d2 = pd.read_csv('doc2.txt',sep= " ",header=None)
data = d1.merge(d2,on=[0],how='left')
print(data)
There will be NAN values in data if second file does not have corresponding indices if you don't want that, you can change the type of join

Comparing two columns and summing the values in Matlab

I have 2 columns like this:
0.0 1.2
0.0 2.3
0.0 1.5
0.1 1.0
0.1 1.2
0.1 1.4
0.1 1.7
0.4 1.1
0.4 1.3
0.4 1.5
In the 1st column, 0.0 is repeated 3 times. I want to sum corresponding elements
(1.2 + 2.3 + 1.5) in the 2nd column. Similarly, 0.1 is repeated 4 times in the 1st
column. I want to sum the corresponding elements (1.0 + 1.2 + 1.4 + 1.7) in the 2nd
column and so on.
I am trying like this
for i = 1:length(col1)
for j = 1:length(col2)
% if col2(j) == col1(i)
% to do
end
end
end
This is a classical use of unique and accumarray:
x = [0.0 1.2
0.0 2.3
0.0 1.5
0.1 1.0
0.1 1.2
0.1 1.4
0.1 1.7
0.4 1.1
0.4 1.3
0.4 1.5]; % data
[~, ~, w] = unique(x(:,1)); % labels of unique elements
result = accumarray(w, x(:,2)); % sum using the above as grouping variable
You can also use the newer splitapply function instead of accumarray:
[~, ~, w] = unique(x(:,1)); % labels of unique elements
result = splitapply(#sum, x(:,2), w); % sum using the above as grouping variable
a=[0.0 1.2
0.0 2.3
0.0 1.5
0.1 1.0
0.1 1.2
0.1 1.4
0.1 1.7
0.4 1.1
0.4 1.3
0.4 1.5]
% Get unique col1 values, and indices
[uniq,~,ib]=unique(a(:,1));
% for each unique value in col1
for ii=1:length(uniq)
% sum all col2 values that correspond to the current index of the unique value
s(ii)=sum(a(ib==ii,2));
end
Gives:
s =
5.0000 5.3000 3.9000

Printing lines containing the least number in groups - AWK/SED/PERL

I would like to print only lines containing the least number in groups. My files contains multiple columns, and I use the first column to determine groups. Let's say 1st, 4th, 6th lines are in the same group because the content of the first column is the same. My goal is to print out a line that contains the least number in the second column for each group.
file.txt:
VDDA 0.7 ....
VDDB 0.2 ....
VDDB 0.3 ....
VDDA 0.4 ....
VSS 0.1 ....
VDDA 0.2 ....
VSS 0.2 ....
output.txt:
VDDA 0.2 ....
VDDB 0.2 ....
VSS 0.1 ....
I think I can do this job with C using a for loop and comparisons, but I think there is a better way using AWK/SED/PERL.
If you are not bothering about the sequence of the 1st field as per Input_file then following may help you in same too. Also this code will be looking for smallest number value for any 1st field and going to print it then.
awk '{a[$1]=a[$1]>$2?$2:(a[$1]?a[$1]:$2)} END{for(i in a){print i,a[i]}}' Input_file
EDIT1: If you want the output in same order as $1 is in, then following may help you in same too.
awk '!a[$1]{b[++i]=$1} {c[$1]=a[$1]>$2?$0:(c[$1]?c[$1]:$0);a[$1]=a[$1]>$2?$2:(a[$1]?a[$1]:$2);} END{for(j=1;j<=i;j++){print b[j],c[b[j]]}}' Input_file
$ awk '{split(a[$1],s);a[$1]=(s[2]<$2 && s[2])?a[$1]:$0} END{for(i in a)print a[i]}' file.txt
VDDA 0.2 ....
VDDB 0.2 ....
VSS 0.1 ....
Brief explanation:
Save $0 into a[$1]
split(a[$1],s): split numeric s[2] from a[$1] for comparing
if the condition s[2]<$2 && s[2] is met, set a[$1]=a[$1], otherwise set a[$1]=$0
With GNU datamash tool:
Assuming the following exemplary input file containing rows with 5 columns:
VDDA 0.7 c1 2 a
VDDB 0.2 c2 3 b
VDDB 0.3 c4 5 c
VDDA 0.4 c5 6 d
VSS 0.1 c6 7 e
VDDA 0.2 c7 8 f
VSS 0.2 c8 9 g
datamash -sWf -g1 min 2 < file | awk '{--NF}1'
The output:
VDDA 0.2 c7 8 f
VDDB 0.2 c2 3 b
VSS 0.1 c6 7 e

How to collect the indices of an array X that has the same lengths between elements

I am trying to make an array Z that have indexes of the most frequent occurring difference between two elements in the array X. So if the most frequent occurring difference between two elements in X is 3 then I would get all the indexes in X that have that difference into array Z.
x = [ 0.2 0.4 0.6 0.4 0.1 0.2 0.2 0.3 0.4 0.3 0.6];
ct = 0;
difference_x = diff(x);
unique_x = unique(difference_x);
for i = 1:length(unique_x)
for j = 1:length(x)
space_between_elements = abs(x(i)-x(i+1));
if space_between_elements == difference_x
ct = ct + 1;
space_set(i,ct) = j;
end
end
end
I Don´t get the indexes of X containing the most frequent difference from this code.
It appears you want to find how many unique differences there are, with "difference" interpreted in an absoulte-value sense; and also find how many times each difference occurs.
You can do that as follows:
x = [ 0.2 0.4 0.6 0.4 0.1 0.2 0.2 0.3 0.4 0.3 0.6]; %// data
difference_x = abs(diff(x));
unique_x = unique(difference_x); %// all differences
counts = histc(difference_x, unique_x); %// count for each difference
However, comparing reals for uniqueness (or equality) is problematic because of finite precision. You should rather apply a tolerance to declare two values as "equal":
x = [ 0.2 0.4 0.6 0.4 0.1 0.2 0.2 0.3 0.4 0.3 0.6]; %// data
tol = 1e-6; %// tolerance
difference_x = abs(diff(x));
difference_x = round(difference_x/tol)*tol; %// apply tolerance
unique_x = unique(difference_x); %// all differences
counts = histc(difference_x, unique_x); %// count for each difference
With your example x, the second approach gives
>> unique_x
unique_x =
0 0.1000 0.2000 0.3000
>> counts
counts =
1 4 3 2

Resources