I have trouble understandig an awk command which I want to change slightly (but can't because I don't understand the code enough).
The result of this awk command is to put together text files having 6 columns. In the output file, the first column is a mix of all first column of the input file. The other columns of the output file are the other column of the input file with added blank if needed, to still match with the first column values.
First, I would like to only parse some specific columns from these files and not all 6. I couldn't figure out where to specify it in the awk loop.
Secondly, the header of the columns are not the first row of the output file anymore. It would be nice to have it as header in the output file as well.
Thirdly, I need to know from which file the data comes from. I know that the command take the files in the order they appear when doing ls -lh *mosdepth.summary.txt so I can deduce that the first 6 columns are from file 1, the 6 next from file 2, ect. However, I would like to automatically have this information in the output file to reduce the potential human errors I can do by infering the origin of the data.
Here is the awk command
awk -F"\t" -v OFS="\t" 'F!=FILENAME { FNUM++; F=FILENAME }
{ COL[$1]++; C=$1; $1=""; A[C, FNUM]=$0 }
END {
for(X in COL)
{
printf("%s", X);
for(N=1; N<=FNUM; N++) printf("%s", A[X, N]);
printf("\n");
}
}' *mosdepth.summary.txt > Se_combined.coverage.txt
the input file look like this
cat file1
chrom length bases mean min max
contig_1_pilon 223468 603256 2.70 0 59
contig_2_pilon 197061 1423255 7.22 0 102
contig_6_pilon 162902 1372153 8.42 0 80
contig_19_pilon 286502 1781926 6.22 0 243
contig_29_pilon 263348 1251842 4.75 0 305
contig_32_pilon 291449 1819758 6.24 0 85
contig_34_pilon 51310 197150 3.84 0 29
contig_37_pilon 548146 4424483 8.07 0 399
contig_41_pilon 7529 163710 21.74 0 59
cat file2
chrom length bases mean min max
contig_2_pilon 197061 2098426 10.65 0 198
contig_19_pilon 286502 1892283 6.60 0 233
contig_32_pilon 291449 2051790 7.04 0 172
contig_37_pilon 548146 6684861 12.20 0 436
contig_42_pilon 14017 306188 21.84 0 162
contig_79_pilon 17365 883750 50.89 0 1708
contig_106_pilon 513441 6917630 13.47 0 447
contig_124_pilon 187518 374354 2.00 0 371
contig_149_pilon 1004879 13603882 13.54 0 801
the wrong output looks like this
contig_149_pilon 1004879 13603882 13.54 0 801
contig_79_pilon 17365 883750 50.89 0 1708
contig_1_pilon 223468 603256 2.70 0 59
contig_106_pilon 513441 6917630 13.47 0 447
contig_2_pilon 197061 1423255 7.22 0 102 197061 2098426 10.65 0 198
chrom length bases mean min max length bases mean min max
contig_37_pilon 548146 4424483 8.07 0 399 548146 6684861 12.20 0 436
contig_41_pilon 7529 163710 21.74 0 59
contig_6_pilon 162902 1372153 8.42 0 80
contig_42_pilon 14017 306188 21.84 0 162
contig_29_pilon 263348 1251842 4.75 0 305
contig_19_pilon 286502 1781926 6.22 0 243 286502 1892283 6.60 0 233
contig_124_pilon 187518 374354 2.00 0 371
contig_34_pilon 51310 197150 3.84 0 29
contig_32_pilon 291449 1819758 6.24 0 85 291449 2051790 7.04 0 172
EDIT:
Thanks to several input from several users I manage to answer my points 1 and 3 like this
awk -F"\t" -v OFS="\t" 'F!=FILENAME { FNUM++; F=FILENAME }
{ B[FNUM]=F; COL[$1]; C=$1; $1=""; A[C, FNUM]=$4}
END {
printf("%s\t", "contig")
for (N=1; N<=FNUM; N++)
{ printf("%.5s\t", B[N])}
printf("\n")
for(X in COL)
{
printf("%s\t", X);
for(N=1; N<=FNUM; N++)
{ printf("%s\t", A[X, N]);
}
printf("\n");
}
}' file1.txt file2.txt > output.txt
with output
contig file1 file2
contig_149_pilon 13.54
contig_79_pilon 50.89
contig_1_pilon 2.70
contig_106_pilon 13.47
contig_2_pilon 7.22 10.65
chrom mean mean
contig_37_pilon 8.07 12.20
contig_41_pilon 21.74
contig_6_pilon 8.42
contig_42_pilon 21.84
contig_29_pilon 4.75
contig_19_pilon 6.22 6.60
contig_124_pilon 2.00
contig_34_pilon 3.84
contig_32_pilon 6.24 7.04
Awk processes files in records, where the records are separated by the record separator RS. Each record is split in fields where the field separator is defined by the variable FS that the -F flag can define.
In the case of the program presented in the OP, the record separator is the default value which is the <newline>-character and the field separator is set to be the <tab>-character.
Awk programs are generally written as a sequence of pattern-action pairs of the form pattern { action }. These pairs are executed sequentially and state to perform action when pattern returns a non-zero or non-empty string value.
In the current program there are three such action-patter pairs:
F!=FILENAME { FNUM++; F=FILENAME }: This states that if the value of F is different from the current FILENAME which is processed, then increase the value of FNUM with one and update the value of F with the current FILENAME.
In the end, this is the same as just checking if we are processing a new file or not. The equivalent version of this would be:
(FNR==1) { FNUM++ }
which reads: If we are processing the first record of the current file (FNR), then increase the file count FNUM.
{ COL[$1]++; C=$1; $1=""; A[C, FNUM]=$0 }: As there is no pattern, it implies true by default. So here, for each record/line increment the number of times you saw the value in the first column and store it in an associative array COL (key-value pairs). Memorize the first field in C and store in an array A the value of the current record, but remove the first field. So if the record of the second file reads "foo A B C D", and foo already been seen 3 times, then, COL["foo"] will be equal to 4 and A["foo",2] will read " A B C D".
END{ ... } This is a special pattern-action pair. Here END indicates that this action should only be executed at the end, when all files have been processed. What the end statement does, is straightforward, it just prints all records of each file. Including empty records.
In the end, this entire script can be simplified to the following:
awk 'BEGIN{ FS="\t" }
{ file_list[FILENAME]
key_list[$1]
record_list[FILENAME,$1]=$0 }
END { for (key in key_list)
for (fname in file_list)
print ( record_list[fname,key] ? record_list[fname,key] : key )
}' file1 file2 file3 ...
Assuming your '*mosdepth.summary.txt' files look like the following:
$ ls *mos*txt
1mosdepth.summary.txt 2mosdepth.summary.txt 3mosdepth.summary.txt
And contents are:
$ cat 1mosdepth.summary.txt
chrom length bases mean min max
contig_1_pilon 223468 1181176 5.29 0 860
contig_2_pilon 197061 2556215 12.97 0 217
contig_6_pilon 162902 2132156 13.09 0 80
$ cat 2mosdepth.summary.txt
chrom length bases mean min max
contig_19_pilon 286502 2067244 7.22 0 345
contig_29_pilon 263348 2222566 8.44 0 765
contig_32_pilon 291449 2671881 9.17 0 128
contig_34_pilon 51310 525393 10.24 0 47
$ cat 3mosdepth.summary.txt
chrom length bases mean min max
contig_37_pilon 548146 6652322 12.14 0 558
contig_41_pilon 7529 144989 19.26 0 71
The following awk command might be appropriate:
$ awk -v target_cols="1 2 3 4 5 6" 'BEGIN{split(target_cols, cols," ")} \
NR==1{printf "%s ", "file#"; for (i=1;i<=length(cols);i++) {printf "%s ", $cols[i]} print ""} \
FNR==1{fnbr++} \
FNR>=2{printf "%s ", fnbr; for (i=1;i<=length(cols);i++) {printf "%s ", $cols[i]} print ""}' *mos*txt | column -t
Output:
file# chrom length bases mean min max
1 contig_1_pilon 223468 1181176 5.29 0 860
1 contig_2_pilon 197061 2556215 12.97 0 217
1 contig_6_pilon 162902 2132156 13.09 0 80
2 contig_19_pilon 286502 2067244 7.22 0 345
2 contig_29_pilon 263348 2222566 8.44 0 765
2 contig_32_pilon 291449 2671881 9.17 0 128
2 contig_34_pilon 51310 525393 10.24 0 47
3 contig_37_pilon 548146 6652322 12.14 0 558
3 contig_41_pilon 7529 144989 19.26 0 71
Alternatively, the following will output the filename rather than file#:
$ awk -v target_cols="1 2 3 4 5 6" 'BEGIN{split(target_cols, cols," ")} \
NR==1{printf "%s ", "fname"; for (i=1;i<=length(cols);i++) {printf "%s ", $cols[i]} print ""} \
FNR==1{fnbr=FILENAME} \
FNR>=2{printf "%s ", fnbr; fnbr="-"; for (i=1;i<=length(cols);i++) {printf "%s ", $cols[i]} print ""}' *mos*txt | column -t
Output:
fname chrom length bases mean min max
1mosdepth.summary.txt contig_1_pilon 223468 1181176 5.29 0 860
- contig_2_pilon 197061 2556215 12.97 0 217
- contig_6_pilon 162902 2132156 13.09 0 80
2mosdepth.summary.txt contig_19_pilon 286502 2067244 7.22 0 345
- contig_29_pilon 263348 2222566 8.44 0 765
- contig_32_pilon 291449 2671881 9.17 0 128
- contig_34_pilon 51310 525393 10.24 0 47
3mosdepth.summary.txt contig_37_pilon 548146 6652322 12.14 0 558
- contig_41_pilon 7529 144989 19.26 0 71
With either command, the target_cols="1 2 3 4 5 6" specifies the targeted columns to extract.
target_cols="1 2 3" for example, will produce:
fname chrom length bases
1mosdepth.summary.txt contig_1_pilon 223468 1181176
- contig_2_pilon 197061 2556215
- contig_6_pilon 162902 2132156
2mosdepth.summary.txt contig_19_pilon 286502 2067244
- contig_29_pilon 263348 2222566
- contig_32_pilon 291449 2671881
- contig_34_pilon 51310 525393
3mosdepth.summary.txt contig_37_pilon 548146 6652322
- contig_41_pilon 7529 144989
target_cols="4 5 6" will produce:
fname mean min max
1mosdepth.summary.txt 5.29 0 860
- 12.97 0 217
- 13.09 0 80
2mosdepth.summary.txt 7.22 0 345
- 8.44 0 765
- 9.17 0 128
- 10.24 0 47
3mosdepth.summary.txt 12.14 0 558
- 19.26 0 71
Related
I have a text file which is stored in a variable say $RC. It looks like below.
Total Copied Skipped Mismatch FAILED Extras
Dirs : 49 10 0 0 0 0
Files : 212 170 37 0 5 2
Bytes : 6.517 t 6.517 t 24.5 k 0 136.37 m 550
When I run
$RC | Measure-Object -Word -character -Line
it gives me the output as
Lines Words Characters Property
----- ----- ---------- --------
4 34 280
if I run $RC[1], it gives me the first line as:-
Dirs : 39 9 0 0 0 0
Now I want to navigate to 7th word in this above line (which is 0), how do I do that?
If that's not possible, my ultimate goal is to find values under Failed column (which are 0,5,136.37) in above text file and store them in another variable for comparison. How can it be done? Thank you in advance.
In case you don't want tot reinvent the wheel; using this ConvertFrom-SourceTable cmdlet:
$RC = '
Total Copied Skipped Mismatch FAILED Extras
Dirs : 49 10 0 0 0 0
Files : 212 170 37 0 5 2
Bytes : 6.517 t 6.517 t 24.5 k 0 136.37 m 550'
$Data = ConvertFrom-SourceTable -Literal $RC
$Data |Format-Table
Total Copied Skipped Mismatch FAILED Extras
----- ------ ------- -------- ------ ------
Dirs : 49 10 0 0 0 0
Files : 212 170 37 0 5 2
Bytes : 6.517 t 6.517 t 24.5 k 0 136.37 m 550
Taking the first -zero based ([0])- row and column Extras:
$Data[0].Extras
0
and taking the whole failed column using Member-Access Enumeration:
$Data.Failed
0
5
136.37 m
I am trying to parse a molecular dynamics dump file which has headers printed periodically. Between two successive headers, I have data (not guaranteed that the lenght of data is the same between any two successive headers) in a column format which I want to store and post-process. Is there a way I can do this without excessive use of for loops?
The basic gist of it is:
ITEM: TIMESTEP
0
ITEM: NUMBER OF ENTRIES
1079
ITEM: BOX BOUNDS xy xz yz ff ff pp
-1e+06 1e+06 0
-1e+06 1e+06 0
-1e+06 1e+06 0
ITEM: ENTRIES index c_1[1] c_1[2] c_2[1] c_2[2] c_2[3] c_2[4] c_2[5]
1 1 94 0.0399999 0 0.171554 -0.00124379 0
2 1 106 0.0399999 0 -0.0638316 0.116503 0
3 1 204 0.0299999 0 -0.124742 0.0290103 0
4 1 675 0.0299999 0 0.0245382 -0.116731 0
5 2 621 0.03 0 0.0328324 0.00185942 0
6 2 656 0.04 0 -0.0315086 0.016237 0
7 2 671 0.04 0 -0.00291159 -0.0169882 0
8 3 76 0.03 0 0.01775 0.0100646 0
9 3 655 0.03 0 0.00434063 -0.00750336 0
.
.
.
.
.
1076 678 692 100000 0 -0.222481 -1.44632e-06 0
1077 679 692 100000 0 -0.00232206 -8.05951e-09 0
1078 682 691 100000 0 0.0753935 -2.89438e-07 0
1079 687 692 100000 0 -0.0153246 -2.51076e-08 0
ITEM: TIMESTEP
1000
ITEM: NUMBER OF ENTRIES
1078
ITEM: BOX BOUNDS xy xz yz ff ff pp
-1e+06 1e+06 0
-1e+06 1e+06 0
-1e+06 1e+06 0
ITEM: ENTRIES index c_1[1] c_1[2] c_2[1] c_2[2] c_2[3] c_2[4] c_2[5]
1 1 94 0.0399997 0 1.3535 -0.00981109 0
2 1 106 0.0399986 0 -6.36969 11.6275 0
3 1 204 0.0299893 0 -236.114 54.9339 0
4 1 675 0.0299998 0 0.148064 -0.704365 0
.
.
.
.
TIA!
You don't need to write a single for loop to parse this file, MATLAB writes them for you:
[headers, tables] = parseTables('tables.txt')
...
function [headers, tables] = parseTables(filename)
content = fileread(filename); % read whole file
lines = splitlines(content); % split lines
values = cellfun(#str2num, lines, 'UniformOutput', false); % convert lines to float, when possible
headerLines = cellfun(#isempty, values); % lines with no floats
headers = lines(headerLines); % extract headers
startLines = find(headerLines)+1; % indices of first lines of tables
endLines = [startLines(2:end)-1; length(values)]; % indices of last lines of tables
tables = arrayfun(#(i, j) cell2mat(values(i:j)), ...
startLines, endLines, 'UniformOutput', false); % merge table rows to single matrix
end
The results will be stored in cell arrays:
headers =
8×1 cell array
{'ITEM: TIMESTEP' }
{'ITEM: NUMBER OF ENTRIES' }
{'ITEM: BOX BOUNDS xy xz yz ff ff pp' }
{'ITEM: ENTRIES index c_1[1] c_1[2] c_2[1] c_2[2] c_2[3] c_2[4] c_2[5] '}
{'ITEM: TIMESTEP' }
{'ITEM: NUMBER OF ENTRIES' }
{'ITEM: BOX BOUNDS xy xz yz ff ff pp' }
{'ITEM: ENTRIES index c_1[1] c_1[2] c_2[1] c_2[2] c_2[3] c_2[4] c_2[5] '}
tables =
8×1 cell array
{[ 0]}
{[ 1079]}
{ 3×3 double}
{13×8 double}
{[ 1000]}
{[ 1078]}
{ 3×3 double}
{ 4×8 double}
I have a tab separated matrix (say filename).
If I do:
head -1 filename | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array)}'
followed by:
head -2 filename | tail -1 | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array)}'
I get an answer of 24 (same answer) for all rows basically.
But if I do it:
cat filename | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array)}'
I get:
24
25
25
25
25 ...
Why is it so?
Following is the inputfile:
Case1 17.49 0.643 0.366 11.892 0.85 5.125 0.589 0.192 0.222 0.231 27.434 0.228 0 0.111 0.568 0.736 0.125 0.038 0.218 0.253 0.055 0.019 0 0.078
Case2 0.944 2.412 4.296 0.329 0.399 1.625 0.196 0.038 0.381 0.208 0.045 1.253 0.382 0.111 0.324 0.268 0.458 0.352 0 1.423 0.887 0.444 5.882 0.543
Case3 21.266 14.952 24.406 10.977 8.511 21.75 6.68 0.613 12.433 1.48 1.441 21.648 6.972 42.931 8.029 4.883 11.912 6.248 4.949 26.882 9.756 5.366 38.655 12.723
Case4 0.888 0 0.594 0.549 0.105 0.125 0 0 0.571 0.116 0.019 1.177 0.573 0.111 0.081 0.401 0 0.05 0.073 0 0 0 0 0.543
Well, I found an answer to my own problem:
I wonder how I missed it, but nullifying the array at the end of each initiation is always critical for repeated usage of same array name (no matter which language/ script one uses).
correct awk was:
cat filename | awk -F "\t" '{i=0;med=0;for(i=2;i<=NF;i++) array[i]=$i;asort(array);print length(array);delete array}'
I need to make a handwritten image to be tested with a neural network in Matlab. When I see the data contained in the training images from the MNIST I see that it is an array of different gray scales like:
Columns 343 through 351
0 -0.0240 0.4002 0.6555 0.0235 -0.0062 0 0 0
Columns 352 through 360
0 0 0 0 0 0 0 0 0
Columns 361 through 369
0 0 0 -0.0079 0.1266 0.3272 -0.0233 0.0005
corresponding to a 20x20 image, unrolled into a 1*400 dimensional array.
I have downloaded an image in jpeg format and did the following:
im=imread('image.jpg');
gi=rgb2gray(im);
gi=gi(:);
gi=gi';
that generates me an array gi that says <1*400 uint8>, the last part of uint8 does not appear in the MNIST samples when I put it in Matlab. When I check it up my array it appear the following values:
Columns 289 through 306
58 105 128 133 142 131 76 21 1 0 3 0 2 4 17 12 7 0
Columns 307 through 324
1 15 42 75 97 105 98 73 31 4 1 0 0 0 0 2 4 3
Columns 325 through 342
0 0 1 4 21 37 55 59 46 26 9 0 0 0 0 0 0 0
Columns 343 through 360
1 1 0 0 0 1 7 14 21 21 14 5 0 0 0 0 0 0
Columns 361 through 378
0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 2 0 0
when I visualize them all is fine, but when I want to run my program the following message appears:
??? Error using ==> mtimes
MTIMES is not fully supported for integer classes. At least one input must be scalar.
Error in ==> predict at 15
h1 = sigmoid([ones(m, 1) X] * Theta1');
Error in ==> ex4 at 241
pred = predict(Theta1, Theta2, gi);
situation that does not occur when I test my program even with one random sample ofc the MNIST data; any help?
You could try something like this:
imfile = 'image.jpg';
im = double(rgb2gray(imread(imfile))); % double and convert to grayscale
im = imresize(im,[20,20]); % change to 20 by 20 dimension
im = im(:); % unroll matrix to vector
im = im./max(im);
Note the MNIST dataset is intended to be a good dataset to require minimal preprocessing and the images were actually originally black and white (bilevel) whereas you are using color image. Also they do normalisation and other preprocessing to make nice 28 by 28 image dataset, my brief snippet of code above is unlikely to be anywhere near as good as MNIST dataset and is just intended to attempt to fix your error.
Your specific error is likely because you don't use double().
You may also get further errors because your code needs right dimensions, which can be achieved using imresize.
More information on MNIST dataset here:
http://yann.lecun.com/exdb/mnist/
temp.bgf
ATOM 218 CB ASN 1 34 -7.84400 -9.19900 -5.03100 C_3 4 0 -0.18000 0 0
ATOM 221 CG ASN 1 34 -7.37700 -7.83400 -4.55200 C_R 3 0 0.55000 0 0
ATOM 226 C ASN 1 34 -9.18200 -10.62100 -6.58300 C_R 3 0 0.51000 0 0
ATOM 393 CB THR 2 69 -3.33000 -7.97700 -7.72000 C_3 4 0 0.14000 0 0
ATOM 397 CG2 THR 2 69 -4.75300 -8.54400 -7.67200 C_3 4 0 -0.27000 0 0
ATOM 401 C THR 2 69 -2.58000 -9.55700 -5.85500 C_R 3 0 0.51000 0 0
ATOM 417 CB THR 2 71 1.99100 -9.86800 -2.77000 C_3 4 0 0.14000 0 0
ATOM 421 CG2 THR 2 71 2.86300 -10.15400 -1.55700 C_3 4 0 -0.27000 0 0
ATOM 425 C THR 2 71 -0.19100 -10.14200 -1.62900 C_R 3 0 0.51000 0 0
ATOM 492 CB CYS 2 77 -5.17100 -14.77100 4.04000 C_3 4 0 -0.11000 0 0
ATOM 495 SG CYS 2 77 -6.29600 -14.88500 2.59500 S_3 2 2 -0.23000 0 0
ATOM 497 C CYS 2 77 -4.65100 -13.75800 6.12000 C_R 3 0 0.51000 0 0
ATOM 2071 CB SER 7 316 -3.87300 -2.15900 1.02300 C_3 4 0 0.05000 0 0
ATOM 2076 C SER 7 316 -4.79700 -1.16500 -1.10800 C_R 3 0 0.51000 0 0
target.bgf
ATOM 575 CB ASP 2 72 -2.80100 -7.45000 -2.09400 C_3 4 0 -0.28000 0 0
ATOM 578 CG ASP 2 72 -3.74900 -6.45900 -1.31600 C_R 3 0 0.62000 0 0
ATOM 581 C ASP 2 72 -3.19300 -9.62400 -0.87900 C_R 3 0 0.51000 0 0
I got two files of data. The first file contains data for the residues I want to calculate the distance to. The second file contains the coordinates for the target residue.
I want to calculate the minimum distance between the two quantities (i.e. ASP and the residues in the temp.bgf). I haven't been able to come up with an optimal way to store the x,y,z values and compare the distance in temp.bgf.
There have been questions as to how the calculation should be done. Here is the idea I have
#asp_atoms
#asn_atoms
$asnmin, aspmin
foreach $ap (#asp_atoms)
{
foreach $an (#asn_atoms)
{
dist = dist($v..$g...);
if($dist < $min)
{
$min = $dist;
}
}
}
I hope that clarifies questions as to how to implement the code. However, the problem I am having is how to store the values in the array and traverse through the file.
Also, to clarify how exactly(i.e. what numbers will be used for distance, here is an example of what I want to do).
For the ASP CB atoms with the following coordinates: -2.80100 -7.45000 -2.09400
I want to calculate the distance between the ASN CB, ASN CG, ASN C atoms. The minimum is the value printed out. Unfortunately, I don't have an exact value as to what that minimum would be, but I have to print out values less than 5 units of distance. Then, the ASP CG atoms distance would be calculated to all the ASN atoms to see the min. So I am trying to find the min distance here.
You can solve this by simply splitting each row from your file on white spaces, then storing the results in arrays of arrays and then slicing out only the parameters you need in loops (in this case x,y,z). This is not a complete answer to your problem but it should give you an idea of how this can be accomplished.
open (my $temp,"<","temp.bgf");
open (my $target,"<","target.bgf");
my #temps = create_ar($temp);
my #targets = create_ar($target);
sub create_ar {
my $filehan = shift;
my #array;
foreach (<$filehan>) {
push #array,[split(/\s+/,$_)];
}
return #array;
}
foreach my $ap (#targets) {
my ($target_X,$target_Y,$target_Z) = #{$ap}[6,7,8];
foreach my $an (#temps) {
my ($temp_X,$temp_Y,$temp_Z) = #{$an}[6,7,8];
...
}
}