substituting some elements of a matrix with new values bash - arrays

I am trying to read in a file, change some parts of it and write it to a new file in bash. I know that I can substitute the parts with the "sed" command but I do not know how to it in my case which is a matrix in the beginning of my file. Here is how my file looks like
alpha
1 0 0
0 1 0
0 0 1
some text here
more numbers here
and I am trying to substitute the values of the matrix above in a for loop like
for i in 1 2 3 4
do
replace 1 0 0
0 1 0
0 0 1
by 1*(1+${i}) ${i}/2 0
${i}/2 1 0
0 0 1
and print the whole file with the substitution to newfile.${i}
done
I want to do this in bash. Any idea how to do this? And I only want to change this part and only this part!

Awk is more suitable for this:
for i in {1..4}; do awk -v i="$i" -f substitute.awk oldfile.txt > newfile.$i; done
using the following substitute.awk script:
{
if( NR == 3 ) { print 1 + i, i / 2, 0 }
else if( NR == 4 ) { print i / 2, 1, 0 }
else print $0
}
(assuming, as you wrote, that the matrix is always in lines 3 through 5; in your example it is in lines 2 through 4)

Related

How to use offset in arrays in bash?

Here is my code.
#! /bin/bash
array=(3 2 1 0 0 0 0 0 0 0)
for i in {0..10}
do
this=${array:$i:$((i+1))}
echo $this
done
I want to print each number of my number separately. I have used this line to get the array elements using an offset number.
this=${array:$i:$((i+1))}
However I am only getting 3 printed and rest all are new lines. I basically want to print 3, 2, 1 etc on separate lines. How do I correct this?
First, you need to use the whole array array[#], not array.
echo "${array[#]:3:2}"
Then, you may change the index to simple variable names:
this=${array[#]:i:i+1}
And then, you probably need to extract just one value of the list:
this=${array[#]:i:1}
Try this code:
array=(3 2 1 0 0 0 0 0 0 0)
for i in {0..10}
do
this=${array[#]:i:1}
echo "$this"
done
There is no reason to use an array slice here, just access the individual elements of the array. Try this:
#! /bin/bash
array=(3 2 1 0 0 0 0 0 0 0)
for i in {0..10}
do
this=${array[$((i+1))]}
echo $this
done
In general you can access a single element of an array like that: ${array[3]}.
Note that in this case, it would have been preferable to do this:
array=(3 2 1 0 0 0 0 0 0 0)
for this in "${array[#]}"
do
echo $this
done

Filter column from file based on header matching a regex

I have the following file
foo_foo bar_blop baz_N toto_N lorem_blop
1 1 0 0 1
1 1 0 0 1
And I'd like to remove the columns with the _N tag on header (or selecting all the others)
So the output should be
foo_foo bar_blop lorem_blop
1 1 1
1 1 1
I found some answers but none were doing this exactly
I know awk can do this but I don't understand how to do it by myself (I'm not good at awk) with this language.
Thanks for the help :)
awk 'NR==1{for(i=1;i<=NF;i++)if(!($i~/_N$/)){a[i]=1;m=i}}
{for(i=1;i<=NF;i++)if(a[i])printf "%s%s",$i,(i==m?RS:FS)}' f|column -t
outputs:
foo_foo bar_blop lorem_blop
1 1 1
1 1 1
$ cat tst.awk
NR==1 {
for (i=1;i<=NF;i++) {
if ( (tgt == "") || ($i !~ tgt) ) {
f[++nf] = i
}
}
}
{
for (i=1; i<=nf; i++) {
printf "%s%s", $(f[i]), (i<nf?OFS:ORS)
}
}
$ awk -v tgt="_N" -f tst.awk file | column -t
foo_foo bar_blop lorem_blop
1 1 1
1 1 1
$ awk -f tst.awk file | column -t
foo_foo bar_blop baz_N toto_N lorem_blop
1 1 0 0 1
1 1 0 0 1
$ awk -v tgt="blop" -f tst.awk file | column -t
foo_foo baz_N toto_N
1 0 0
1 0 0
The main difference between this and #Kent's solution is performance and the impact will vary based on the percentage of fields you want to print on each line.
The above when reading the first line of the file creates an array of the field numbers to print and then for every line of the input file it just prints those fields in a loop. So if you wanted to print 3 out of 100 fields then this script would just loop through 3 iterations/fields on each input line.
#Kent's solution also creates an array of the field numbers to print but then for every line of the input file it visits every field to test if it's in that array before printing or not. So if you wanted to print 3 out of 100 fields then #Kent's script would loop through all 100 iterations/fields on each input line.

perl array size is smaller than it should be

I want to initialize 4^9 (=262144) indices of #clump as 0. So I wrote this:
my $k=9;
my #clump=();
my $n=4**$k;
for(my $i=0;$i<$n;$i++){
push(#clump,0);
print "$i ";
}
But it keeps freezing at 261632! I then tried making $n=5^9 (=1953125) and my code stopped at 1952392. So its definitely not a memory issue. This should be simple enough but I can't figure out what's wrong with my code. Help a newbie?
Suffering from buffering?
When I add a sleep 1000 to the end of your program, stream the output to a file, and read the tail of the file, I also observe the last numbers to be printed are 261632 and 1952392. The remaining output is stuck in the output buffer, waiting for some event (the buffer filling up, the filehandle closing, the program exiting, or an explicit flush call) to flush the output.
The buffering can be changed by one of the following statements early in your program
$|= 1;
STDOUT->autoflush(1);
#!/usr/bin/env perl
use strict;
use warnings;
my $k = 9;
my $n = 4 ** $k;
my #clump = (0) x $n;
print join(' ', #clump), "\n";
printf "%d elements in \#clump\n", scalar #clump;
Or,
#!/usr/bin/env perl
use strict;
use warnings;
my $k = 9;
my $n = 4 ** $k;
my #clump;
$#clump = $n - 1;
$_ = 0 for #clump;
print join(' ', #clump), "\n";
printf "%d elements in \#clump\n", scalar #clump;
Output:
...
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
262144 elements in #clump
Also, note that initialization with 0 is almost never required in Perl. Why do you need this?

MATLAB removing rows which has duplicates in sequence

I'm trying to remove the rows which has duplicates in sequence. I have only 2 possible values which are 0 and 1. I have nXm which n shows possible number of bits and m is not important for my question. My goal is to find an matrix which is nX(m-a). The rows a which has the property which includes duplicates in sequence. For example:
My matrix is :
A=[0 1 0 1 0 1;
0 0 0 1 1 1;
0 0 1 0 0 1;
0 1 0 0 1 0;
1 0 0 0 1 0]
I want to remove the rows has t duplicates in sequence for 0. In this question let's assume t is 3. So I want the matrix which:
B=[0 1 0 1 0 1;
0 0 1 0 0 1;
0 1 0 0 1 0]
2nd and 5th rows are removed.
I probably need to use diff.
So you want to remove rows of A that contain at least t zeros in sequence.
How about a single line?
B = A(~any(conv2(1,ones(1,t),2*A-1,'valid')==-t, 2),:);
How this works:
Transform A to bipolar form (2*A-1)
Convolve each row with a sequence of t ones (conv2(...))
Keep only rows for which the convolution does not contain -t (~any(...)). The presence of -t indicates a sequence of t zeros in the corresponding row of A.
To remove rows that contain at least t ones, just change -t to t:
B = A(~any(conv2(1,ones(1,t),2*A-1,'valid')==t, 2),:);
Here is a generalized approach which removes any rows which has given number of consecutive duplicates (not just zero. could be any number).
t = 3;
row_mask = ~any(all(~diff(reshape(im2col(A,[1 t],'sliding'),t,size(A,1),[]))),3);
out = A(row_mask,:)
Sample Run:
>> A
A =
0 1 0 1 0 1
0 0 1 5 5 5 %// consecutive 3 5's
0 0 1 0 0 1
0 1 0 0 1 0
1 1 1 0 0 1 %// consecutive 3 1's
>> out
out =
0 1 0 1 0 1
0 0 1 0 0 1
0 1 0 0 1 0
How about an approach using strings? This is certainly not as fast as Luis Mendo's method where you work directly with the numerical array, but it's thinking a bit outside of the box. The basis of this approach is that I consider each row of A to be a unique string, and I can search each string for occurrences of a string of 0s by regular expressions.
A=[0 1 0 1 0 1;
0 0 0 1 1 1;
0 0 1 0 0 1;
0 1 0 0 1 0;
1 0 0 0 1 0];
t = 3;
B = sprintfc('%s', char('0' + A));
ind = cellfun('isempty', regexp(B, repmat('0', [1 t])));
B(~ind) = [];
B = double(char(B) - '0');
We get:
B =
0 1 0 1 0 1
0 0 1 0 0 1
0 1 0 0 1 0
Explanation
Line 1: Convert each line of the matrix A into a string consisting of 0s and 1s. Each line becomes a cell in a cell array. This uses the undocumented function sprintfc to facilitate this cell array conversion.
Line 2: I use regular expressions to find any occurrences of a string of 0s that is t long. I first use repmat to create a search string that is full of 0s and is t long. After, I determine if each line in this cell array contains this sequence of characters (i.e. 000....). The function regexp helps us perform regular expressions and returns the locations of any matches for each cell in the cell array. Alternatively, you can use the function strfind for more recent versions of MATLAB to speed up the computation, but I chose regexp so that the solution is compatible with most MATLAB distributions out there.
Continuing on, the output of regexp/strfind is a cell array of elements where each cell reports the locations of where we found the particular string. If we have a match, there should be at least one location that is reported at the output, so I check to see if any matches are empty, meaning that these are the rows we don't want to remove. I want to turn this into a logical array for the purposes of removing rows from A, and so this is wrapped with a cellfun call to determine the cells that are empty. Therefore, this line returns a logical array where a 0 means that remove this row and a 1 means that we don't.
Line 3: I take the logical array from Line 2 and invert it because that's what we really want. We use this inverted array to index into the cell array and remove those strings.
Line 4: The output is still a cell array, so I convert it back into a character array, and finally back into a numerical array.

A question about matrix manipulation

Given a 1*N matrix or an array, how do I find the first 4 elements which have the same value and then store the index for those elements?
PS:
I'm just curious. What if we want to find the first 4 elements whose value differences are within a certain range, say below 2? For example, M=[10,15,14.5,9,15.1,8.5,15.5,9.5], the elements I'm looking for will be 15,14.5,15.1,15.5 and the indices will be 2,3,5,7.
If you want the first value present 4 times in the array 'tab' in Matlab, you can use
num_min = 4
val=NaN;
for i = tab
if sum(tab==i) >= num_min
val = i;
break
end
end
ind = find(tab==val, num_min);
By instance with
tab = [2 4 4 5 4 6 4 5 5 4 6 9 5 5]
you get
val =
4
ind =
2 3 5 7
Here is my MATLAB solution:
array = randi(5, [1 10]); %# random array of integers
n = unique(array)'; %'# unique elements
[r,~] = find(cumsum(bsxfun(#eq,array,n),2) == 4, 1, 'first');
if isempty(r)
val = []; ind = []; %# no answer
else
val = n(r); %# the value found
ind = find(array == val, 4); %# indices of elements corresponding to val
end
Example:
array =
1 5 3 3 1 5 4 2 3 3
val =
3
ind =
3 4 9 10
Explanation:
First of all, we extract the list of unique elements. In the example used above, we have:
n =
1
2
3
4
5
Then using the BSXFUN function, we compare each unique value against the entire vector array we have. This is equivalent to the following:
result = zeros(length(n),length(array));
for i=1:length(n)
result(i,:) = (array == n(i)); %# row-by-row
end
Continuing with the same example we get:
result =
1 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0
0 0 1 1 0 0 0 0 1 1
0 0 0 0 0 0 1 0 0 0
0 1 0 0 0 1 0 0 0 0
Next we call CUMSUM on the result matrix to compute the cumulative sum along the rows. Each row will give us how many times the element in question appeared so far:
>> cumsum(result,2)
ans =
1 1 1 1 2 2 2 2 2 2
0 0 0 0 0 0 0 1 1 1
0 0 1 2 2 2 2 2 3 4
0 0 0 0 0 0 1 1 1 1
0 1 1 1 1 2 2 2 2 2
Then we compare that against four cumsum(result,2)==4 (since we want the location where an element appeared for the forth time):
>> cumsum(result,2)==4
ans =
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
Finally we call FIND to look for the first appearing 1 according to a column-wise order: if we traverse the matrix from the previous step column-by-column, then the row of the first appearing 1 indicates the index of the element we are looking for. In this case, it was the third row (r=3), thus the third element in the unique vector is the answer val = n(r). Note that if we had multiple elements repeated 4 times or more in the original array, then the one first appearing for the forth time will show up first as a 1 going column-by-column in the above expression.
Finding the indices of the corresponding answer value is a simple call to FIND...
Here is C++ code
std::map<int,std::vector<int> > dict;
std::vector<int> ans(4);//here we will store indexes
bool noanswer=true;
//my_vector is a vector, which we must analize
for(int i=0;i<my_vector.size();++i)
{
std::vector<int> &temp = dict[my_vector[i]];
temp.push_back(i);
if(temp.size()==4)//we find ans
{
std::copy(temp.begin(),temp.end(),ans.begin() );
noanswer = false;
break;
}
}
if(noanswer)
std::cout<<"No Answer!"<<std::endl;
Ignore this and use Amro's mighty solution . . .
Here is how I'd do it in Matlab. The matrix can be any size and contain any range of values and this should work. This solution will automatically find a value and then the indicies of the first 4 elements without being fed the search value a priori.
tab = [2 5 4 5 4 6 4 5 5 4 6 9 5 5]
%this is a loop to find the indicies of groups of 4 identical elements
tot = zeros(size(tab));
for nn = 1:numel(tab)
idxs=find(tab == tab(nn), 4, 'first');
if numel(idxs)<4
tot(nn) = Inf;
else
tot(nn) = sum(idxs);
end
end
%find the first 4 identical
bestTot = find(tot == min(tot), 1, 'first' );
%store the indicies you are interested in.
indiciesOfInterst = find(tab == tab(bestTot), 4, 'first')
Since I couldn't easily understand some of the solutions, I made that one:
l = 10; m = 5; array = randi(m, [1 l])
A = zeros(l,m); % m is the maximum value (may) in array
A(sub2ind([l,m],1:l,array)) = 1;
s = sum(A,1);
b = find(s(array) == 4,1);
% now in b is the index of the first element
if (~isempty(b))
find(array == array(b))
else
disp('nothing found');
end
I find this easier to visualize. It fills '1' in all places of a square matrix, where values in array exist - according to their position (row) and value (column). This is than summed up easily and mapped to the original array. Drawback: if array contains very large values, A may get relative large too.
You're PS question is more complicated. I didn't have time to check each case but the idea is here :
M=[10,15,14.5,9,15.1,8.5,15.5,9.5]
val = NaN;
num_min = 4;
delta = 2;
[Ms, iMs] = sort(M);
dMs = diff(Ms);
ind_min=Inf;
n = 0;
for i = 1:length(dMs)
if dMs(i) <= delta
n=n+1;
else
n=0;
end
if n == (num_min-1)
if (iMs(i) < ind_min)
ind_min = iMs(i);
end
end
end
ind = sort(iMs(ind_min + (0:num_min-1)))
val = M(ind)

Resources