sparse attention and its relation with attention mask - sparse-matrix

Can anyone please explain in a clear way what is the usage of mask in attention for sparse attention? I just can not get how masking tokens (I do not mean here pad tokens) can make attention faster as example as mentioned in sparse attention definition in GMAT under section Global-Memory Augmented Transformers . Does not softmax work on the same number of sequence length? so how is it faster?

Related

How expressive can we be with arrays in Z3(Py)? An example

My first question is whether I can express the following formula in Z3Py:
Exists i::Integer s.t. (0<=i<|arr|) & (avg(arr)+t<arr[i])
This means: whether there is a position i::0<i<|arr| in the array whose value a[i] is greater than the average of the array avg(arr) plus a given threshold t.
I know this kind of expressions can be queried in Dafny and (since Dafny uses Z3 below) I guess this can be done in Z3Py.
My second question is: how expressive is the decidable fragment involving arrays in Z3?
I read this paper on how the full theory of arrays is not decidable (http://theory.stanford.edu/~arbrad/papers/arrays.pdf), but only a concrete fragment, the array property fragment.
Is there any interesting paper/tutorial on what can and cannot be done with arrays+quantifiers+functions in Z3?
You found the best paper to read regarding reasoning with Array's, so I doubt there's a better resource or a tutorial out there for you.
I think the sequence logic (not yet officially supported by SMTLib, but z3 supports it), is the right logic to use for reasoning about these sorts of problems, see: https://microsoft.github.io/z3guide/docs/theories/Sequences/
Having said that, most properties about arrays/sequences of "arbitrary size" require inductive proofs. This is because most interesting functions on them are essentially recursive (or iterative), and induction is the only way to prove properties for such programs. While SMT solvers improved significantly regarding support for recursive definitions and induction, they still don't perform anywhere near well compared to a traditional theorem prover. (This is, of course, to be expected.)
I'd recommend looking at the sequence logic, and playing around with recursive definitions. You might get some mileage out of that, though don't expect proofs for anything that require induction, especially if the inductive-hypothesis needs some clever invariant to be specified.
Note that if you know the length of your array concretely (i.e., 10, 15, or some other hopefully not too large a number), then it's best to allocate the elements symbolically yourself, and not use arrays/sequences at all. (And you can repeat your proof for lenghts 0, 1, 2, .. upto some fixed number.) But if you want proofs that work for arbitrary lengths, your best bet is to use sequences in z3, not arrays; with all the caveats I mentioned above.

Efficient encoding algorithm for sparse vector

Does anyone know how to compress(encoding) the sparse vector?
the sparse vector means 1xN matrix that have many "0".
for example
10000000000001110000000000000000100000000
above is the example of sparse vector.
Ofcourse, I know the run length algorithm.
I want other algorithm for encoding this type of vector.
help me plz...
Use modified Huffman coding, as suggested in the source code on this site: https://www.programminglogic.com/implementing-huffman-coding-in-c/. This is used in FAXs and it is a common knowledge that the modified Huffman is the first improvement over the simple run-length-encoding that you have already figured out. Note that this is generally used for ASCII, but you may implement it for any stream as long as you may determine which combination of bits are more frequent than the others.

Alternate cut off matrix

If I have an alternate confusion matrix similar to one above. I am trying to predict bad risk.
I do understand that with .04 probability I am able to find more bad risk but I would want to know what implication does this .04 has interms of business.Why do I want a different cut off then .5?

Toom-Cook multiplication algorithm implementation

I have a task to implement Toom-Cook 3-way multiplication algorithm. I'm following description on wikipedia http://en.wikipedia.org/wiki/Toom%E2%80%93Cook_multiplication , and I managed to store two big numbers into strings and split the strings into smaller ones according to the "Splitting" step on the wikipedia page. The next step is "evaluation", and I have to calculate a new number p0 = m0 + m2 ("Faster evaluation" by Bordrato - found on the same page) where m0 and m2 are the digits which I created by splitting the large number (in the previous step). The problem is that I cannot simply add up m0 and m2, since those two numbers can still be very large and impossible to add together in a standard way. Does this mean that I have to implement my own algorithm for adding large numbers (as well as substracting and dividing, since they are also needed), or am I missing something? If anyone could link me a possible implementation or even a pseudo code, it would be appreciated.
You have to implement your own methods for addition, subtraction, modulo, etc. Sometime ago I was trying to implement a BigInteger library and I have found some resources that may be useful for you.
BigNum Math book (as pointed by the previous answer)
Java OpenJdk
BigInteger implementation, with documentation
Algorithms and data structures The basic toolbox, (I have learned Karatsube of this book).
By the way, I recommend to use base 2 for your numbers(see here.) because you can take advantage of the nature of the computer to make your operations more easy and fast.
LibTomMath is open source and includes a Toom-Cook multiplication; have a look.

What is the most practical board representation for a magic bitboard move generation system?

I am rewriting a chess engine I wrote to run on magic bitboards. I have the magic functions written and they take a square of the piece and an occupancy bitboard as parameters. What I am having debates with myself is which one of these board representation schemes is faster/more practical:
scheme 1: There is a bitboard for each type of piece, 1 for white knights, 1 for black rooks. . . , and in order to generate moves and push them to the move stack, I must serialize them to find the square of the piece and then call the magic function. Then I must serialize that move bitboard and push them. The advantage is is that the attacking and occupancy bitboards are closer at hand.
scheme 2: A simple piece centric array [2][16] or [32] contains the square indices of the pieces. A simply loopthrough and call of the functions is all it takes for the move bitboards. I then serialize those bitboards and push them to the move stack. I also have to maintain an occupancy bitboard. I guess getting an attack bitboard shouldn't be any different: I have to once again generate all the move bitboards and, instead of serializing them, I bitwise operate them in a mashup of magic.
I'm leaning towards scheme 2, but for some reason I think there is some sort of implementation similar to scheme 1 that is standard. For some reason I can't find drawbacks of making a "bitboard" engine without actually using bitboards. I wouldn't even be using bitboards for king and knight data, just a quick array lookup.
I guess my question is more of whether there is a better way to do this board representation, because I remember reading that keeping a bitboard for each type of piece is standard (maybe this is only necessary with rotated bitboards?). I'm relatively new to bitboard engines but I've read a lot and I've implemented the magic method. I certainly like the piece centric array approach - it makes a lot of arbitrary stuff like printing the board to the screen easier, but if there is a better/equal/more standard way can someone please point it out? Thanks in advance - I know this is a fairly specific question and difficult to answer unless you are very familiar with chess programming.
Last minute question: how is the speed of a lookup into a 2D array measure up to using a 1D array and adding 16 * team_side to the normal index to lookup the piece?
edit: I thought I should add that I am valuing speed over almost all else in my chess implementation. Why else would I go with magic bitboards as opposed to simply arrays with slide data?
There is no standard answer to this, sorry.
The number and types of data structures you need depends on exactly what you want to do in your program. For example, having more than one representation of the pieces on the board makes some operations faster. On the other hand, it takes more time to update your data during each move.
To get the maximum speed, it is your job to find out what works best for your program. Does maintaining an extra array of pieces result in a net speedup for a program? It depends!
Sometimes it is a net gain to maintain a data structure, sometimes you can delay the calculations and cache the result, and sometimes you just calculate it when needed (and hope it isn't needed very often).

Resources