Lets say I have a tab delimited file lookup.txt
070-031 070-291 030-031
1 2 X
2 3 1
3 4 2
4 5 3
5 6 4
6 7 5
7 8 6
8 9 7
And I have the following files with values to lookup from
$cat 030-031.txt
Line1 070-291 4
Line2 070-031 3
$cat 070-031.txt
Line1 030-031 5
Line2 070-291 8
I would like script.awk to return
$script.awk 030-031.txt lookup.txt
Line1 070-291 4 2
Line2 070-031 3 2
and
$script.awk 070-031.txt lookup.txt
Line1 030-031 5 6
Line2 070-291 8 7
The only thing I can think to do is to create two separate expanded lookup.txt eg
$cat lookup_030-031.txt
070-031:1 X
070-031:2 1
070-031:3 2
070-031:4 3
070-031:5 4
070-031:6 5
070-031:7 6
070-031:8 7
070-291:2 X
070-291:3 1
070-291:4 2
070-291:5 3
070-291:6 4
070-291:7 5
070-291:8 6
070-291:9 7
and then
awk 'NR==FNR { a[$1]=$2;next}{print $0,a[$2":"$3]}' lookup_030-031.txt 030-031.txt
This works but I have many more columns and approximately 10000 rows, so i'd rather not have to generate a lookup file for each. Many thanks
AMENDED
Glenn Jackman's answer is a perfect solution to the initial question and his second answer is more efficient. However, I forgot to stipulate that the script should handle duplicates. For instance, it should be able to handle
$cat 030-031
070-031 3
070-031 6
and return BOTH corresponding numbers for the respective file (2 and 5 respectively). Only Glens first answer handles repeated lookups. His second returns the last values found.
OK, I see now. You have to read the lookup file into a big datastructure, then referencing with the individual files is easy.
$ cat script.awk
BEGIN {OFS = "\t"}
NR==1 {
for (i=1; i<=NF; i++)
label[i] = $i
next
}
NR==FNR {
for (i=1; i<=NF; i++)
for (j=1; j<=NF; j++)
if (i != j)
value[label[i],$i,label[j]] = $j
next
}
FNR==1 {
split(FILENAME, a, /\./)
j = a[1]
}
{
$(NF+1) = value[$1,$2,j]
print
}
$ awk -f script.awk lookup.txt 030-031.txt
070-291 4 2
070-031 3 2
$ awk -f script.awk lookup.txt 070-031.txt
030-031 5 6
070-291 8 7
This version is a bit more compact, and passes the filenames in your preferred order:
$ script.awk
BEGIN {OFS = "\t"}
NR==1 {
split(FILENAME, a, /\./)
dest = a[1]
}
NR==FNR {
src[$1]=$2
next
}
FNR==1 {
for (i=1; i<=NF; i++)
col[$i]=i
next
}
{
for (from in src)
if ($col[from] == src[from])
print from, src[from], $col[dest]
}
$ awk -f script.awk 030-031.txt lookup.txt
070-031 3 2
070-291 4 2
$ awk -f script.awk 070-031.txt lookup.txt
030-031 5 6
070-291 8 7
This works but I have many more columns and approximately 10000 rows, so i'd rather not have to generate a lookup file for each.
Your dataset is small enough to where you have the option of keeping the lookups in memory.
In a BEGIN section, read "lookup.txt" into a two-dimension (nested) array so that:
lookup['070-031'][4] = 3
lookup['070-291'][5] = 3
The run through all the data files all at once:
script.awk 070-031.txt 070-291.txt
Related
we would like to interpolate data within in array using 'awk'. I have an array
1 1 3 3 ...
3 3 5 3
5 5 6 5
6 6 6 5
for which we would like to obtain
1 1 3 3 ...
2 2 4 4
3 3 5 3
4 4 5.5 4
5 5 6 5
6 6 6 5
Doing so would allow us to have a complete array covering all possible data for the first column, representing a timeline. Column 2 and more are data. The matrix is of size of 4x110100. We have this script:
awk '
{
P[$1]=$2
I[i++]=$1
}
END{
j=0; s=I[j]; t=I[j+1]
for(i=m;i<=n;i++){
if(I[j+2] && i>t){
j++; s=I[j]; t=I[j+1]
}
print i,P[s]+(i-s)*(P[t]-P[s])/(t-s)
}
}' m=1 n=6 f1.dat > f2.dat
but it only does it for the first two columns as
1 1
2 2
3 3
4 4
5 5
6 6
How could we extend the interpolation to the entire array?! I have tried with 'for' or 'while' scripts, but we cannot achieve the aim...
You can do this by keeping track of just the current and previous lines:
BEGIN {
# initialise "previous" line
getline;
for (i=0; i<=NF; i++) p[i] = $i;
}
{
# print previous line
print p[0];
# check if column 1 has skipped
if ( (d = $1-p[1]) > 1 ) {
# if so, insert (d-1) new rows
for (i=1; i<d; i++) {
# interpolate values for each column
for (c=1; c<=NF; c++) {
printf "%s%s",
p[c] + (i/d)*($c-p[c]), # linear interpolation
c==NF ? ORS : OFS; # avoid trailing spaces
}
}
}
# update previous line
for (i=0; i<=NF; i++) p[i] = $i;
}
END {
# print the final line
print p[0];
}
{
k = 0
x = 0
fracon = (10/2)+1
{
for (j = 1; j <= 1100 ; j++)
{
if (j <= fracon)
scal[j]= j-x
else
k= k + 1
scal[j]= j - (2*k)
{
if (scal[j] == 1)
fracon= fracon+11
{
if (j % 11 == 0)
x=x+11
k=k+0.5
}
}
}
}
}
That's all. I used the above code to generate the following array. It works in Matlab, but it does not work in awk.
array= [1 2 3 4 5 6 5 4 3 2 1 1 2 3 4 5 6]
here is another way of generating the same sequence
$ awk 'BEGIN{for(i=0;i<=20;i++) {k=i%11+1; printf "%s ", (k<7?k:12-k)}; print ""}'
1 2 3 4 5 6 5 4 3 2 1 1 2 3 4 5 6 5 4 3 2
not sure what you want is just repeated on a 11 element cycle or not; difficult to say based on limited sample.
or without awk
$ yes $({ seq 6; seq 5 -1 1; } | paste -sd' ') | head -100 | paste -sd' '
1 2 3 4 5 6 5 4 3 2 1 1 2 3 4 5 6 5 4 3 2 1 ...
with square brackets
$ awk 'BEGIN{printf "[";
for(i=0;i<=1100;i++) {k=i%11+1; printf "%s ", (k<7?k:12-k)};
printf "]\n"}'
[1 2 3 4 5 6 5 4 3 2 1 1 2 3 4 5 6 ... 5 4 3 2 1 ]
Stuffing these values into a large array is not optimal, you can write a function to return the indexed value easily
$ awk 'function k(i,_i) {_i=i%11+1; return _i<7?_i:12-_i}
BEGIN{for(i=0;i<=25;i++) print k(i)}'
in the real code, you'll use k(i) instead of printing. Note the array index starts from 0.
N.B. the _i is a local variable in the awk function; you don't need to use in the call syntax.
I have the following file
foo_foo bar_blop baz_N toto_N lorem_blop
1 1 0 0 1
1 1 0 0 1
And I'd like to remove the columns with the _N tag on header (or selecting all the others)
So the output should be
foo_foo bar_blop lorem_blop
1 1 1
1 1 1
I found some answers but none were doing this exactly
I know awk can do this but I don't understand how to do it by myself (I'm not good at awk) with this language.
Thanks for the help :)
awk 'NR==1{for(i=1;i<=NF;i++)if(!($i~/_N$/)){a[i]=1;m=i}}
{for(i=1;i<=NF;i++)if(a[i])printf "%s%s",$i,(i==m?RS:FS)}' f|column -t
outputs:
foo_foo bar_blop lorem_blop
1 1 1
1 1 1
$ cat tst.awk
NR==1 {
for (i=1;i<=NF;i++) {
if ( (tgt == "") || ($i !~ tgt) ) {
f[++nf] = i
}
}
}
{
for (i=1; i<=nf; i++) {
printf "%s%s", $(f[i]), (i<nf?OFS:ORS)
}
}
$ awk -v tgt="_N" -f tst.awk file | column -t
foo_foo bar_blop lorem_blop
1 1 1
1 1 1
$ awk -f tst.awk file | column -t
foo_foo bar_blop baz_N toto_N lorem_blop
1 1 0 0 1
1 1 0 0 1
$ awk -v tgt="blop" -f tst.awk file | column -t
foo_foo baz_N toto_N
1 0 0
1 0 0
The main difference between this and #Kent's solution is performance and the impact will vary based on the percentage of fields you want to print on each line.
The above when reading the first line of the file creates an array of the field numbers to print and then for every line of the input file it just prints those fields in a loop. So if you wanted to print 3 out of 100 fields then this script would just loop through 3 iterations/fields on each input line.
#Kent's solution also creates an array of the field numbers to print but then for every line of the input file it visits every field to test if it's in that array before printing or not. So if you wanted to print 3 out of 100 fields then #Kent's script would loop through all 100 iterations/fields on each input line.
I'm trying to convert a continuous stream of data (random) into comma separated and line separated values. I'm converting the continuous data into csv and then after some columns (let's say 80), I need to put a newline and repeat the process until.
Here's what I did for csv:
gawk '$1=$1' FIELDWIDTHS='4 5 7 1 9 5 10 6 8 3 2 2 8 4 8 8 4 6 9 1' OFS=, tmp
'tmp' is the file with following data:
"ZaOAkHEnOsBmD5yZk8cNLC26rIFGSLpzuGHtZgb4VUP4x1Pd21bukeK6wUYNueQQMglvExbnjEaHuoxU0b7Dcne5Y4JP332RzgiI3ZDgHOzm0gjDLVat8au7uckM3t60nqFX0Cy93jXZ5T0IaQ4fw2JfdNF1PbqxDxXv7UGiyysFJ8z16TmYQ9zfBRCZvZirIyRboHNEGgMUFZ18y8XXCGrbpeL0WLstzpSuXetmo47G2xPkDLDcFA6cdM4WAFNpoC2ztspY7YyVsoMZdU7D3u3Lm6dDcKuJKdTV6600GkbLuvAamKGyzMtoqW3liI3ybdTNR9KLz2l7KTjUiGgc3Eci5wnhIosAUMkcSQVxFrZdJ9MVyj6duXAk0CJoRvHYuyfdAr7vjlwjkLkYPtFvAZp6wK3dfetoh3ZmhJhUxqzuxOLDQ9FYcvz64iuIUbgXVZoRnpRoNGw7j3fCwyaqCi..."
I'm generating the continuous sequence from /dev/urandom. I'm not getting how to repeat the gawk after some column by adding a newline character after the column ends.
I got it actually. A simple for loop did that.
Here's my whole code:
for i in $(seq 10)
do
tr -dc A-Za-z0-9 < /dev/urandom | head -c 100 > tmp
gawk '$1=$1' FIELDWIDTHS='4 5 7 1 9 5 10 6 8 3 2 2 8 4 8 8 4 6 9 1' OFS=, tmp >> tmp1
done
Any optimizations would be appreciated.
When I go to import a matrix of data, in the first row of the first column there is a marker for every new time data is acquired and this marker is interfering with how MATLAB imports the data.
Is there a way to code this out?
for example:
'>1 6 1 1 -0.00161
1 6 1 2 -0.00140
1 6 1 3 -0.00145
1 6 1 4 -0.00153
1 6 1 5 -0.00120
1 6 1 6 -0.00076
I would prefer to not manually remove the > from the data as there will be potentially thousands.
If you're under *nix system or you have cygwin then you can get rid of these > if you send this output to the command sed. For instance:
user#host $ cat out.txt
>0 5 3 4
0 6 4 3
>1 5 3 6
1 2 4 5
user#host $ cat out.txt |sed 's/>//g'
If you need to store this new output to a file:
user#host $ cat out.txt
0 5 3 4
0 6 4 3
>1 5 3 6
1 2 4 5
user#host $ cat out.txt |sed 's/>//g' > out_without_unneeded_symbols.txt
user#host $ cat out_without_unneeded_symbols.txt
0 5 3 4
0 6 4 3
1 5 3 6
1 2 4 5
If this output is taken from some program at current dir:
user#host $ ./some_program |sed 's/>//g'
Here is one possible implementation in MATLAB:
% read file lines as a cell array of strings
fid = fopen('file.dat', 'rt');
C = textscan(fid, '%s', 'Delimiter','');
C = C{1};
fclose(fid);
% find marker locations
markers = strncmp('>', C, 1);
% remove markers
C = regexprep(C, '^>', '');
% parse numbers into a numeric matrix
X = regexp(C, '\s+', 'split');
X = str2double(vertcat(X{:}));
The result:
% the full matrix
>> X
X =
0 5 3 4
0 6 4 3
1 5 3 6
1 2 4 5
% only the marked rows
>> X(markers,:)
ans =
0 5 3 4
1 5 3 6