naming array from an array in GAWK - arrays

I have a file with repeating elements. I would like to assign records to an array until the file repeats, at which point I want to create a new array to assign the records to. I would like to do this an arbitrary amount of times.
for example.
$ cat repeat.txt
a
b
c
d
e
f
g
a
b
c
d
e
f
g
a
b
c
d
e
f
g
I want the output to be something like this
0 a a a
1 b b b
2 c c c
3 d d d
4 e e e
5 f f f
6 g g g
right now I am doing this with this hideous code.
awk 'BEGIN{n=0;z=0}
$1~"a" {n=0;z++}
z==1{a[n]=$0}
z==2{b[n]=$0}
z==3{c[n]=$0}
z==4{d[n]=$0}
z==5{e[n]=$0}
z==6{f[n]=$0}
{n++}
END{for (i in a)
print i,a[i],b[i],c[i],d[i],e[i],f[i],g[i],h[i],k[i],j[i]}'
repeat.txt
I would like the assignment of new arrays to be automatic.
I attempted this by the following
echo "abcdefghijklmopqrstuvwxyz" > alphabet.txt
awk 'BEGIN{N=0}
NR==FNR{FS=""}
NR==FNR{for (zz=0;zz<=NF;zz++) a[zz]=$zz; next}
NR!=FNR{FS="\t"}
NR!=FNR{if ($0~a) N++; (a[N])[N]=$0}
END{for (I in (a[N])) print I,(a[N])[I]}' alphabet.txt repeat.txt
but this didn't work because you can't do multidimensional arrays like this in gawk. I can't think of another way to do this.

Related

Common Lisp: Why does lparallel have problems with assigning array elements?

I've written a function that copies many elements from one array to another. I wanted to speed it up using the (pdotimes) function from lparallel. The code looks like this:
(pdotimes (i (size output))
(setf (row-major-aref output i)
(row-major-aref input (dostuff i))))
The (dostuff) function does arithmetic on the row-major output index i to convert it to the row-major input index. When I run this function, the results tend to look like this:
#2A((9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5)
(9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5)
(9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5)
(9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5)
(9 0 9 9 9 9 9 9 9 9 5 5 0 5 5 5 5 5 5 5)
(9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5)
(9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5)
(9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5)
(9 9 9 9 9 9 0 0 9 9 5 5 5 5 5 5 5 5 5 5)
(9 9 9 9 9 9 9 9 9 9 5 5 5 5 5 5 5 5 5 5))
The function is supposed to catenate a matrix of 9s on the left and a matrix of 5s on the right. But notice that there are a few 0s in there too. Zeroes are the initial value for the output matrix, so that means that those elements didn't get assigned.
The non-assignment of elements is seemingly random; run the function many times and zeroes will appear in different places. For some reason, those elements are being missed.
I've tried wrapping the function in a future, like this:
(let ((f (future (pdotimes ...))))
(force f))
But that doesn't work either. One thing I've noticed is that the larger the number of threads and the smaller the size of the array, the more elements get missed. It suggests that the array element assignments are clobbering each other somehow.
I've also tried using (pmap-into) to map the function's results into a vector that's displaced to the output, but that fails in a different way: instead of 0s showing up where elements weren't assigned, elements get assigned in the wrong places. If the array contains repeating "1 2 3 4" sub-vectors, sometimes a "1 2 2" sequence will appear, for example.
AFAIK it should be possible for threads to concurrently assign different elements in the same array, but does Common Lisp have problems with this? Do I need to implement a lock so assignments are guaranteed to happen synchronously? If simultaneous assignments were a problem, I'd expect to see more unassigned elements. Any help appreciated.
Edit: I seem to have found how to prevent this, but not the root cause. Try running this in SBCL:
(let ((output (make-array '(20 20) :initial-element 0 :element-type '(unsigned-byte 7))))
(check-type output simple-array)
(pdotimes (i (array-total-size output) output)
(setf (row-major-aref output i)
(random-elt '(1 2 3 4 5 6)))))
No zeroes will appear in the output. Now try this in SBCL:
(let ((output (make-array '(20 20) :initial-element 0 :element-type '(unsigned-byte 4))))
(check-type output simple-array)
(pdotimes (i (array-total-size output) output)
(setf (row-major-aref output i)
(random-elt '(1 2 3 4 5 6)))))
And see zeroes aplenty. I just tested this with CCL and the output was fine. I'm going to try some other CLs but it seems like this is an SBCL problem so far. For some reason, SBCL has problems doing concurrent assignments to arrays with elements smaller than 7 bits. Character arrays are fine, as are floats and t-type arrays.
This is a slightly speculative answer, but I'm reasonably sure it's correct.
If an implementation supports arrays whose element size (in bits) is smaller than the smallest object the machine can read from and write to memory, and if it stores those arrays without wasted space (which is, really, the only purpose of having them), then the only approach to updating an array element is:
read smallest object containing element from memory;
update object with element;
write back.
Since writes to different array elements can result in reading and writing the same smallest object from memory, this is not safe in the presence of multiple threads without interlocking which would generally have catastrophic performance effects.
Probably all CL implementations have such arrays for modern machines which can't write single bits to memory, in the form of bit arrays. SBCL also has arrays of element types with 2 and 4 bits, which, assuming machines can read & write no object smaller than 8 bits are also in this area. It's also possible that arrays with very large object types could suffer from the same problem, if multiple reads & writes are required to load & store an object.
It should be possible to look at the disassembly of code that uses such arrays to see the behaviour. It's probably also the case that such arrays have lower performance than ones with larger element types (experimentally this is true for SBCL on x64: code which initialises an (unsigned-byte 4) array is 2.5 times slower than that which initialises an (unsigned-byte 8) array).
As a note, I suspect strongly the right approach to getting good performance out of array-bashing code is to partition the arrays amongst the cores in a fairly smart way.
That being said, here's a way to initialize an array of nibbles ((unsigned-byte 4)s) which I think should be safe on the assumption that the smallest object that can be written atomically is a byte. The trick is to write pairs of even-odd addresses at once:
(defun initialize-nibble-array (a)
;; the idea is to put some pattern in it I can see if it has holes
(declare (type (array (unsigned-byte 4) *) a))
(let ((s (array-total-size a)))
(pdotimes (i (truncate s 2))
(let ((rmi (* i 2)))
(setf (row-major-aref a rmi) (mod rmi 8)
(row-major-aref a (1+ rmi)) (mod (1+ rmi) 8))))
(when (oddp s)
;; if the array has an odd number of elements we've missed one
;; at the end
(setf (row-major-aref a (- s 1)) (mod (- s 1) 8)))
a))
I wrote a minimal example as follows (uses lparallel and alexandria)
(let ((output (make-array '(20 20) :initial-element '_)))
(check-type output simple-array)
(pdotimes (i (array-total-size output) output)
(setf (row-major-aref output i)
(random-elt '(a b c d e f g h)))))
And it consistently fills the output grid as follows, each time:
#2A((B G C D H A F E D C F D F G D F A C G G)
(C E D D F A H A F D G E G A C C F G E G)
(H C A E C F E H E D F G D B H B B A H D)
(D H G H H A E B G D E G D E G C E A B B)
(B E H G E E C D A H F A E C F D D A H H)
(C B D D G D H H D G H C A A H G B G C C)
(H H D D C F D B H B H G B C F G H F D E)
(F B C C A H D H G H C D G G D F E G A B)
(A E G C C H F C F C E F H H D E C H H D)
(H G H C D F G E D E C E A H C E A H H H)
(E C B E E C A D B G A F C B G A D G F D)
(H D D H A E A A G D H B H D A G A G C F)
(C D F H D G A D E C F C C D F A F F C H)
(H H D E C B C B E B B G G H H B A A E H)
(G F C C B F C D D D H F A B C F F C A B)
(D A H B B F H B B B F F H B G B H C F E)
(A G H C D H A H C H B F D D A G A E B G)
(G H A D H G B E A A B F C E G G G D E D)
(C E G F H F A A A H D D F B F C H B G B)
(H E H D D F F H E G G A A E D G C H H B))
But, 3.6 Traversal Rules and Side Effects
says that the consequences are undefined if you modify a fill-pointer (impossible for non-vectors) or adjust the array (?). But your example does not look like the array is being adjusted.
Sorry for the question but does it work with dotimes? Does my example work on your machine?

Join four columns into one according to each row

A B C D
E F G H
I J K L
M N O P
If I chose to join the columns I would ={A1:A;B1:B;C1:C;D1:D} but it would look like this:
A
E
I
M
B
F
J
N
... and so on
I would like it to look like this:
A
B
C
D
E
F
G
... and so on
How to proceed in this case?
Note: It may happen that some of the columns are not complete in data, some may have more values than the others, but I still want to continue following this same pattern. Example:
A B D
E G H
I J K L
M N O P
Result:
A
B
D
E
G
H
... and so on
use:
=TRANSPOSE(QUERY(TRANSPOSE(A:D),, 9^9))
then:
=TRANSPOSE(SPLIT(QUERY(TRANSPOSE(QUERY(TRANSPOSE(A:D),,9^9)),,9^9), " "))

Batch insert heading/newline to ASCII file if value of column changes

I have a file similar to this:
A B C
D E C
F G C
A B X
F G X
A B Q
D E Q
Thats what I am looking for
> C
A B C
D E C
F G C
> X
A B X
F G X
> Q
A B Q
D E Q
So far I have a kind of complicated work-around.
Using AWK to add a empty line.
awk -v i=3 "NR>0 && $i!=p { print "A" }{ p=$i } 1" file.txt
I dont manage to add a ">" directly with awk since its a newline value. Instead of the "A", awk is outputting a empty line. Not really sure why..
Using then
sed -e "s/^$/>/" file.txt
I manage to insert a ">" to the empty line but the heading behind is still missing.
sed is for doing s/old/new, that is all. What you are attempting to do is not just s/old/new so you shouldn't be considering using sed, just use awk:
$ awk '$3!=p{print ">", $3; p=$3} 1' file
> C
A B C
D E C
F G C
> X
A B X
F G X
> Q
A B Q
D E Q
awk solution. Assuming that your input file is sorted:
awk '!a[$NF]++{ print ">",$NF }1' file
The output:
> C
A B C
D E C
F G C
> X
A B X
F G X
> Q
A B Q
D E Q
Could you please try following also and let me know if this helps you.
awk 'NR==1{print ">",$3 RS $0;prev=$3;next} prev!=$3{print ">",$3};1; {prev=$3}' Input_file
Output will be as follows.
> C
A B C
D E C
F G C
> X
A B X
F G X
> Q
A B Q
D E Q

Replace corresponding parts of one array with another array in R

I have a array/ named vector that looks like this:
d f g
1 2 3
I want to fill up the empty slots, meaning I want this:
a b c d e f g
0 0 0 1 0 2 3
Is there an elegant way of doing this, without having to write loops and conditionals? In my actual problem, instead of abcd as my array names, it's numbers. Not sure if that makes a difference. Figured alphabet is easier to understand for a reproducible example.
Create a vector of the final names, nms and then create a named vector of zeros from it using sapply and replace the elements corresponding to input names with the input values.
v <- c(d = 1, f = 2, g = 3) # input
nms <- letters[letters <= max(names(v))] # names on output vector, i.e. letters[1:7]
replace(sapply(nms, function(x) 0), names(v), v) ##
giving:
a b c d e f g
0 0 0 1 0 2 3
If in your actual vector the names are not letters then just set nms yourself. For example, nms <- c("dogs", "cats", "d", "elephants", "f", "g") would work with the same line marked ## above.
2) An alternative is to replace the line marked ## above with:
unlist(modifyList(as.list(setNames(numeric(length(nms)), nms)), as.list(v)))
Data
x <- c(d=1L,f=2L,g=3L);
x;
## d f g
## 1 2 3
Solution 1: First match new names into x and extract values, then replace NAs with zero.
x <- setNames(x[match(letters[1:7],names(x))],letters[1:7]);
x[is.na(x)] <- 0L;
x;
## a b c d e f g
## 0 0 0 1 0 2 3
Solution 2: One-liner, using nomatch argument of match().
setNames(c(x,0L)[match(letters[1:7],names(x),nomatch=length(x)+1L)],letters[1:7]);
## a b c d e f g
## 0 0 0 1 0 2 3

Cropping a .ppm file in C

I'm working on a C program that crops .ppm files from a starting point pixel (x,y) (top left corner of cropped image) to an end point pixel (x+w,x+h)(bottom left corner of cropped image).
The data in .ppm files is of the following format:
r g b r g b r g b r g b r g b r g b
r g b r g b r g b r g b r g b r g b
r g b r g b r g b r g b r g b r g b
r g b r g b r g b r g b r g b r g b
Is there a simple way, wich avoids the use of 2 dimensional arrays, to do this using scanf()?
One easy way would be to simply keep track of your pixel coordinate as you read the file in. If you're currently in the crop rectangle, store the pixel; otherwise, skip it.
If you want to get more fancy: figure out the byte offset for the start of each row, seek to it, then read in the whole row.
Warning, some pnm files are in binary mode (they differ by magic number in the beginning of the file contents).
Maybe lookup the sources of pnmcrop would help?

Resources