Fetching indices of a text file from another text file

Fetching indices of a text file from another text file - arrays

The title may not be so descriptive. Let me explain:
I have a file (Say File 1) having some numbers [delimited by a space]. see here,
1 2 3 4 5
1 2 8 4 5 6 7
1 9 3 4 5 6 7 8
..... n lines (length of each line varies).
I have another file (Say File 2) having some numbers [delimited by a tab]. see here,
1 1 1 1 1 1 0 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 1 1 1 1 1
1 1 1 1 1 1 0 1 1 1 1 1
..... m lines (length of each line fixed).
I want sum of 1 2 3 4 5 th (file 1 Line 1) position of file 2, line 1
I want sum of 1 2 3 4 5 6 7 th (file 1 Line 2) position of file 2, line 1 and so on.
I want linewise sum of file 2 with positions all lines in file 1
It will look like:
5 6 6 …n columns (File 1)
1 8 3
9 8 4
… m rows (File 2)
I did this by the following code:
open( FH1, "File1.txt" );
#index = <FH1>;
open( FH2, "File2.txt" );
#matrix = <FH2>;
open( OUTPUT, ">sum.txt" );
foreach $xx (#matrix) {
#k1 = split( /\t/, "$xx" );
foreach $yy (#index) {
#k2 = split( / /, "$yy" );
$ssum = 0;
foreach $zz (#k2) {
$zz1 = $zz - 1;
if ( $k1[$zz1] == 1 ) {
$ssum++;
}
}
printf OUTPUT"$ssum\t";
$ssum = 0;
}
print OUTPUT"\n";
}
close FH1;
close FH2;
close OUTPUT;
It works absolutely fine except that, the time time requirement is enormous for large files. (e.g. 1000 lines File 1 X 25000 lines File 2 : The time is 8 minutes .
My data may exceed 4 times this example. And it's unacceptable for my users.
How to accomplish this, consuming much lesser time. or by Any other concept.

Always include use strict; and use warnings; in every PERL script.
You can simplify your script by not processing the first file multiple times. Also, you coding style is very outdated. You use with some lessons from Modern Perl Book by chromatic.
The following is your script simplified to take advantage of more modern style and techniques. Note, that it currently loads the file data from inside the script instead of external sources:
use strict;
use warnings;
use autodie;
use List::Util qw(sum);
my #indexes = do {
#open my $fh, '<', "File1.txt";
open my $fh, '<', \ "1 2 3 4 5\n1 2 8 4 5 6 7\n1 9 3 4 5 6 7 8\n";
map { [map {$_ - 1} split ' '] } <$fh>
};
#open my $infh, '<', "File2.txt";
my $infh = \*DATA;
#open my $outfh, '>', "sum.txt";
my $outfh = \*STDOUT;
while (<$infh>) {
my #vals = split ' ';
print $outfh join(' ', map {sum(#vals[#$_])} #indexes), "\n";
}
__DATA__
1 1 1 1 1 1 0 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 1 1 1 1 1
1 1 1 1 1 1 0 1 1 1 1 1
Outputs:
5 6 7
5 7 8
5 6 7
5 6 7

Related

Nested for-loop: error variable already defined

I have a nested loop in Stata with four levels of foreach statements. With this loop, I am trying to create a new variable named strata that ranges from 1 to 40.
foreach x in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 {
foreach r in 1 2 3 4 5 {
foreach s in 1 2 {
foreach a in 1 2 3 4 {
gen strata= `x' if race==`r' & sex==`s' & age==`a'
}
}
}
}
I get an error :
"variable strata already defined"
Even with the error, the loop does assign strata = 1, but not the rest of the strata. All other cells are missing/empty.
Example data:
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(age sex race)
1 2 2
1 2 1
1 1 1
1 1 1
1 2 1
2 2 1
2 2 1
4 2 1
1 2 1
4 2 1
3 2 1
2 2 1
4 2 1
4 2 2
3 2 1
4 1 3
4 2 1
4 2 1
2 1 2
4 2 1
2 2 1
3 2 1
3 2 1
1 2 3
4 2 1
1 2 5
4 2 1
4 2 1
4 2 2
4 2 1
2 2 1
4 1 1
3 2 1
1 2 1
2 2 1
4 2 1
1 2 2
2 2 3
1 1 3
4 2 1
2 2 3
1 2 1
1 1 1
2 2 3
1 2 1
1 1 3
1 2 1
2 2 1
3 2 1
1 2 1
4 2 1
1 2 2
1 2 1
2 2 1
4 2 1
4 2 1
1 2 1
1 2 1
4 2 1
2 2 1
4 2 1
1 2 1
1 1 3
2 2 1
1 1 1
4 1 1
3 2 1
2 2 1
1 2 1
1 1 1
2 2 3
4 2 2
2 2 1
2 2 1
3 2 1
2 2 2
3 2 1
2 1 1
1 1 1
3 2 1
1 2 3
4 2 1
4 2 1
2 2 1
1 2 1
1 1 1
3 2 1
4 2 1
2 2 3
1 2 3
4 2 1
3 2 1
2 2 1
4 2 1
3 2 1
2 1 1
1 2 1
2 2 1
2 2 3
1 1 1
end
label values sex sex
label def sex 1 "male (1)", modify
label def sex 2 "female (2)", modify
label values race race
label def race 1 "non-Hispanic white (1)", modify
label def race 2 "black (2)", modify
label def race 3 "AAPI/other (3)", modify
label def race 5 "Hispanic (5)", modify

generate is for generating new variables. The second time your code reaches a generate statement, the code fails for the reason given.
One answer is that you need to generate your variable outside the loops and then replace inside.
For other reasons your code can be rewritten in stages.
First, integer sequences can be more easily and efficiently specified with forvalues, which can be abbreviated: I tend to write forval.
gen strata = .
forval x = 1/40 {
forval r = 1/5 {
forval s = 1/2 {
forval a = 1/4 {
replace strata = `x' if race==`r' & sex==`s' & age==`a'
}
}
}
}
Second, the code is flawed any way. Everything ends up as 40!
Third, you can do allocations much more directly, say by
gen strata = 8 * (race - 1) + 4 * (sex - 1) + age
This is a self-contained reproducible demonstration:
clear
set obs 5
gen race = _n
expand 2
bysort race : gen sex = _n
expand 4
bysort race sex : gen age = _n
gen strata = 8 * (race - 1) + 4 * (sex - 1) + age
isid strata
Clearly you can and should vary the recipe for a different preferred scheme.

How do we make arrays in awk?

{
k = 0
x = 0
fracon = (10/2)+1
{
for (j = 1; j <= 1100 ; j++)
{
if (j <= fracon)
scal[j]= j-x
else
k= k + 1
scal[j]= j - (2*k)
{
if (scal[j] == 1)
fracon= fracon+11
{
if (j % 11 == 0)
x=x+11
k=k+0.5
}
}
}
}
}
That's all. I used the above code to generate the following array. It works in Matlab, but it does not work in awk.
array= [1 2 3 4 5 6 5 4 3 2 1 1 2 3 4 5 6]

here is another way of generating the same sequence
$ awk 'BEGIN{for(i=0;i<=20;i++) {k=i%11+1; printf "%s ", (k<7?k:12-k)}; print ""}'
1 2 3 4 5 6 5 4 3 2 1 1 2 3 4 5 6 5 4 3 2
not sure what you want is just repeated on a 11 element cycle or not; difficult to say based on limited sample.
or without awk
$ yes $({ seq 6; seq 5 -1 1; } | paste -sd' ') | head -100 | paste -sd' '
1 2 3 4 5 6 5 4 3 2 1 1 2 3 4 5 6 5 4 3 2 1 ...
with square brackets
$ awk 'BEGIN{printf "[";
for(i=0;i<=1100;i++) {k=i%11+1; printf "%s ", (k<7?k:12-k)};
printf "]\n"}'
[1 2 3 4 5 6 5 4 3 2 1 1 2 3 4 5 6 ... 5 4 3 2 1 ]
Stuffing these values into a large array is not optimal, you can write a function to return the indexed value easily
$ awk 'function k(i,_i) {_i=i%11+1; return _i<7?_i:12-_i}
BEGIN{for(i=0;i<=25;i++) print k(i)}'
in the real code, you'll use k(i) instead of printing. Note the array index starts from 0.
N.B. the _i is a local variable in the awk function; you don't need to use in the call syntax.

How to identify a periodic sequence in an array of integers

The main goal is to find a periodic sequence in an array with bash,for example :
{ 2, 5, 7, 8, 2, 6, 5, 3, 5, 4, 2, 5, 7, 8, 2, 6, 5, 3, 5, 4, 2, 5, 7, 8, 2, 6, 5, 3, 5, 4 }
or { 2, 5, 6, 3, 4, 2, 5, 6, 3, 4, 2, 5, 6, 3, 4, 2, 5, 6, 3, 4 }
which must return as identified sequence for the two example
{ 2, 5, 7, 8, 2, 6, 5, 3, 5, 4 } and { 2, 5, 6, 3, 4 }
I tried with a list and a sub-list made of two arrays but with no success.
I must be missing something in my loops . I think to the "tortoise and hare" algorithm as an alternative but i miss some knowledge in bash commands to implement it .
I prefer to post my second try with tortoise and hare as the first seem to be a useless try :
#!/bin/bash
declare -A array=( 1, 2, 3, 1, 2, 3, 1, 2, 3 )
declare -A found=()
loop="notfound"
tortoise=`echo ${array[0]}`
hare=`echo ${array[0]}`
found[0]=`echo ${array[0]}`
while ( $loop == "notfound" )
do
for ((i=1;i=`echo ${#array[#]}`;i++))
do
if (( `echo ${array[$#]}` == $hare ))
then
echo "no loop found"
exit 0
fi
hare=`echo ${array[$i]}`
if (( `echo ${array[$#]}` == $hare ))
then
echo "no loop found"
exit 0
fi
hare=`echo ${array[$(($i+1))]}`
tortoise=`echo ${array[$i]}`
found[$i]=`echo ${array[$i]}`
if (( $hare == $tortoise ))
then
loop="found"
printf "$found[#]}"
fi
done
done
I got errors on associative array needing indice

Given an array a of single decimal digits
a=(2 5 7 8 2 6 5 3 5 4 2 5 7 8 2 6 5 3 5 4 2 5 7 8 2 6 5 3 5 4)
then using regular expression backsubstitution, for example in perl
printf '%d' "${a[#]}" | perl -lne 'print $1 if /^(\d+)\1+/'
2578265354
Testing with an incomplete sequence
a=(1 2 3 1 2 3 1 2)
printf '%d' "${a[#]}" | perl -lne 'print $1 if /^(\d+)\1+/'
123
If you only want complete repeats, add a $ line anchor to the RE, /^(\d+)\1+$/
Now, if you want to identify the longest subsequence that is "most nearly" repeated, that's a little trickier. For example, in the case of your 250-digit sequence, there is a 118-digit subsequence, repeated 2 times (with 16 characters left over), whereas your expected output is a 13-digit subsequence (repeated 19 times, with 3 digits left over). So you want an algorithm that is "greedy but not too greedy".
One (hopefully not too inefficient) way to do that would be to successively remove trailing digits until an anchored match is obtained i.e. the entire remaining sequence s* may be represented as n x t for some subsequence t. In perl, we can write that as a simple loop
perl -lne 'while (! s/^(\d+)\1+$/$1/) {chop $_}; print'
Testing with your 250-digit sequence:
a=( 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 2 1 2 0 0 2 0 2 2 2 1 1 0 )
Then
printf '%d' "${a[#]}" | perl -lne 'while (! s/^(\d+)\1+$/$1/) {chop $_}; print'
1102120020222
NOTE: this will fail to terminate if the string is exhausted before a match is found; if that's a possibility, you will need to test for that and break out of the while loop.

I tested this only with the inputs you provided.
assumptions - pattern to match always starts at the beginning of the array and repeats there after.
#!/bin/bash
#arr=(2 5 7 8 2 6 5 3 5 4 2 5 7 8 2 6 5 3 5 4 2 5 7 8 2 6 5 3 5 4)
arr=(2 5 6 3 4 2 5 6 3 4 2 5 6 3 4 2 5 6 3 4)
echo ${arr[#]}
n=${#arr[*]}
match=0
in_pattern=false
print_array()
{
local first=$1
local last=$2
local i
for ((i=first; i<=last; i++));do
printf "%d " ${arr[i]}
done
printf "\n"
}
i=0
start=0
end=0
j=$((i+1))
while (( j < n )); do
#echo "arr[$i] ${arr[i]} arr[$j] ${arr[j]}"
if [[ ${arr[i]} -ne ${arr[j]} ]];then
if [[ $match -ge 1 ]];then
echo "arr[$i] != arr[$j]"
echo "pattern doesnt repeat after match # $match"
exit 1
fi
((j++))
i=0
in_pattern=false
continue
fi
if $in_pattern ; then
if [[ $i -eq $end ]];then
((match++))
end_match=$j
echo "match # $match matched from $start -> $end and $start_match -> $end_match"
print_array $start $end
print_array $start_match $end_match
((j++))
i=0
in_pattern=false
continue
fi
else
if [[ $match -eq 0 ]];then
end=$((j-1))
fi
start_match=$j
in_pattern=true
#echo "trying to match from start $start end $end to start_match $start_match"
fi
((i++))
((j++))
done
output with first array -
./sequence.sh
2 5 7 8 2 6 5 3 5 4 2 5 7 8 2 6 5 3 5 4 2 5 7 8 2 6 5 3 5 4
match # 1 matched from 0 -> 9 and 10 -> 19
2 5 7 8 2 6 5 3 5 4
2 5 7 8 2 6 5 3 5 4
match # 2 matched from 0 -> 9 and 20 -> 29
2 5 7 8 2 6 5 3 5 4
2nd array -
/sequence.sh
2 5 6 3 4 2 5 6 3 4 2 5 6 3 4 2 5 6 3 4
match # 1 matched from 0 -> 4 and 5 -> 9
2 5 6 3 4
2 5 6 3 4
match # 2 matched from 0 -> 4 and 10 -> 14
2 5 6 3 4
2 5 6 3 4
match # 3 matched from 0 -> 4 and 15 -> 19
2 5 6 3 4
2 5 6 3 4

How to import data with markers - but excluding those markers?

When I go to import a matrix of data, in the first row of the first column there is a marker for every new time data is acquired and this marker is interfering with how MATLAB imports the data.
Is there a way to code this out?
for example:
'>1 6 1 1 -0.00161
1 6 1 2 -0.00140
1 6 1 3 -0.00145
1 6 1 4 -0.00153
1 6 1 5 -0.00120
1 6 1 6 -0.00076
I would prefer to not manually remove the > from the data as there will be potentially thousands.

If you're under *nix system or you have cygwin then you can get rid of these > if you send this output to the command sed. For instance:
user#host $ cat out.txt
>0 5 3 4
0 6 4 3
>1 5 3 6
1 2 4 5
user#host $ cat out.txt |sed 's/>//g'
If you need to store this new output to a file:
user#host $ cat out.txt
0 5 3 4
0 6 4 3
>1 5 3 6
1 2 4 5
user#host $ cat out.txt |sed 's/>//g' > out_without_unneeded_symbols.txt
user#host $ cat out_without_unneeded_symbols.txt
0 5 3 4
0 6 4 3
1 5 3 6
1 2 4 5
If this output is taken from some program at current dir:
user#host $ ./some_program |sed 's/>//g'

Here is one possible implementation in MATLAB:
% read file lines as a cell array of strings
fid = fopen('file.dat', 'rt');
C = textscan(fid, '%s', 'Delimiter','');
C = C{1};
fclose(fid);
% find marker locations
markers = strncmp('>', C, 1);
% remove markers
C = regexprep(C, '^>', '');
% parse numbers into a numeric matrix
X = regexp(C, '\s+', 'split');
X = str2double(vertcat(X{:}));
The result:
% the full matrix
>> X
X =
0 5 3 4
0 6 4 3
1 5 3 6
1 2 4 5
% only the marked rows
>> X(markers,:)
ans =
0 5 3 4
1 5 3 6

how to vectorize the following for loop?

can any one help me to Vectorized this loop.
i have large Matrix and i want to replace all the pixel values whose length is less then some threshold Value For simplicity lets say
a = randi([1 5],10,10);
for i = 1:length(a)
someMat=a(a==i);
if length(someMat)<20
a(a==i)=0;
end
end
but its killing me.
Example:
a = randi([1 5],10,10)
a =
5 2 1 5 5 5 2 2 3 2
3 3 5 4 4 4 3 1 1 5
5 1 3 5 3 3 4 1 3 1
3 1 5 3 2 5 1 1 5 1
1 1 4 3 4 3 4 4 5 1
1 4 3 5 1 1 2 2 2 1
3 3 5 2 4 1 1 3 2 4
4 1 5 3 4 5 3 4 3 3
5 3 5 5 4 3 1 3 4 1
4 1 1 3 5 5 1 3 3 5
Result for Thresold 20
5 0 1 5 5 5 0 0 3 0
3 3 5 0 0 0 3 1 1 5
5 1 3 5 3 3 0 1 3 1
3 1 5 3 0 5 1 1 5 1
1 1 0 3 0 3 0 0 5 1
1 0 3 5 1 1 0 0 0 1
3 3 5 0 0 1 1 3 0 0
0 1 5 3 0 5 3 0 3 3
5 3 5 5 0 3 1 3 0 1
0 1 1 3 5 5 1 3 3 5
length of pixel 4 was 17
length of pixel 2 was 10
i try it by some thing like
[nVal Index] = histc(a(:),unique(a)); %
nVal(nVal>20) = 1; % just some threshold value and assigning by some Number may be zero as well
But I dont Know how to replace the Index Values of the corresponding Pixal and apply reshape to get it in original form. Here Even i am not sure that i will get the same Matrix With Reshape . Please Help me.....
thanks

I think this does what you want:
threshold_length = 20;
replace_value = 0;
u = unique(a); %// values of a
h = histc(a(:), u); %// count for each value
r = u(h<threshold_length); %// values to be removed
a(ismember(a,r)) = replace_value; %// remove those values

I see #LuisMendo arrived at mostly the same solution quicker than I did, but an alternative to using ismember is to use more of what unique gives you:
threshold = 20;
[vals, ~, ix] = unique(a); % capture the values and their indices
counts = histc(a(:), vals); % count the occurrences of each value
vals(counts<threshold) = 0; % zero the values that aren't common enough
a(:) = vals(ix); % recreate the matrix with updated values

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Fetching indices of a text file from another text file - arrays

Related

Nested for-loop: error variable already defined

How do we make arrays in awk?

How to identify a periodic sequence in an array of integers

How to import data with markers - but excluding those markers?

how to vectorize the following for loop?

Categories

Resources