i need to use the modulo operation inside a kernel and it is slowing things down. It is impossible for me to remove it. Basically i have a%b where b is not a power of 2. Is there any way to avoid using it?
Can you prefetch the answers and use a lookup table?
Instead of
c = a%b;
you could then try
c = table[a][b];
Some considerations to signature and tablesize have to be made.
Depending on the overall usecase you could move this table to a higher level and remove more that just this single computation.
A custom implementation of modulo would use the definition of it
(a/b)*b + a%b == a; //true
a%b == a - (a/b)*b // true
Depending on the likely values for a and b you could try to optimize this.
Depending on your target hardware you could try to see if there is a speedy hardwaresolution that can solve this for a specific product. (see this)
There may be more solutions out there.
Related
From the basics of the C language we know that in the following code:
y = fn3(fn2(fn1(x)));
...the fn1() is executed first, the fn2() is executed second and the fn3() is executed last.
What order of matrix transformations is built by the following C code ?:
ctm = fz_pre_translate(fz_pre_rotate(fz_scale(sx, sy), r), tx, ty);
Case A or Case B ?:
The documentation of the muPDF Library API is available at this link and it states the following on Page 61:
Alternatively, operations can be specifically applied to existing
matrices. Because of the non-commutative nature of matrix operations,
it matters whether the new operation is applied before or after the
existing matrix. For example, if you have a matrix that performs a
rotation, and you wish to combine that with a translation, you must
decide whether you want the translation to occur before the rotation
(‘pre’) or afterwards (‘post’).
MuPDF has various API functions for such operations:
To me the statement above suggests that the order of transformations, being built by these functions, is not the same as the order of nested function evaluations in C (and their invocations) ...but I just can't be sure.
In mathematical terms, case A, which is a translation, followed by a rotation, followed by scaling could be expressed as
x' = S · (R · (T · x)) = S · R · T · x = ((S · R) · T ) · x
So we want to apply the translation before the other transformations combined, while the scaling should only be applied after.
M = S · R · T = MS, R · T = S · MR, T
To me the statement above suggests that the order of transformations, being built by these functions, is not the same as the order of nested function evaluations in C.
I'd say that a richer API, like the one exposed by this library, let the user choose between different ways of expressing their intent, but it doesn't (of course) violate the rules of C nested functions calls.
There could be cases in which particular algorithms may benefit from one approach or the other.
ctm = fz_pre_translate(fz_pre_rotate(fz_scale(sx, sy), r), tx, ty);
Note that, despite the order in which the nested functions are called, this line can be read exactly as the first statement of this answer (first translation, then rotation, then scale), while the mathematical notation is basically backwards.
ctm = fz_post_scale(fz_post_rotate(fz_translate(tx, ty), r), sx, sy);
Whether this generates confusion or not in the reader, I'm afraid is a matter of opinion and personal backgrounds.
To my knowledge, though, having a small as possible public API is considered less error prone and easier to mantain.
A common problem I encounter when I want to write concise/readable code:
I want to update all the values of a vector matching a logical expression with a value that depends on the previous value.
For example, double all even entries:
weights = [10 7 4 8 3];
weights(mod(weights,2)==0) = weights(mod(weights,2)==0) * 2;
% weights = [20 7 8 16 3]
Is it possible to write the second line in a more concise fashion (i.e. avoiding the double use of the logical expression, something like i+=3 for i=i+3 in other languages). If I often use this kind of vector operation in different contexts/variables, and I have long conditionals, I feel that my code is less concise and readable than it could be.
Thanks!
How about
ind = mod(weights,2)==0;
weights(ind) = weights(ind)*2;
This way you avoid calculating the indices twice and it's easy to read.
Starting your other comment to Wauzl, such powerful operation capabilities is the Fortran side. This is purely matlab's design that is quickly getting obsolete. Let's use this horribleness further:
for i=1:length(weights),if (mod(weights(i),2)==0)weights(i)=weights(i)*2;end,end
It is even slightly faster than your two liner because you are doing the conditional indexing twice on both sides. In general, consider switching to Python3.
Well, I after more searching around, I found this link that deals with this issue (I used search before posting, I swear!), and there is interesting further discussion regarding this topic in the links in that thread. So apparently there are issues with ambiguity when introducing such an operator.
Looks like that is the price we have to pay in terms of syntactic limitations for having such powerful matrix operation capabilities.
Thanks a lot anyway, Wauzl!
Consider as an example the matrix
X(a,b) = [a b
a a]
I would like to perform some relatively intensive matrix algebra computations with X, update the values of a and b and then repeat.
I can see two ways of storing the entries of X:
1) As numbers (i.e. floats). Then after our matrix algebra operation we update all the values in X to the correct values of a and b.
2) As pointers to a and b, so that after updating them, the entries of X are automatically updated.
Now, I initially thought method (2) was the way to go as it skips out the updating step. However I believe that using method (1) allows a better use of the cache when doing for example matrix multiplication in parallel (although I am no expert so please correct me if I'm wrong).
My hypothesis is that for unexpensive matrix computations you should use method (2) and there will be some threshold as the computation becomes more complex that you should switch to (1).
I imagine this is not too uncommon a problem and my question is which is the optimal method to use for general matrices X?
Neither approach sounds particularly hard to implement. The simplest answer is make a test calculation, try both, and benchmark them. Take the faster one. Depending on what sort of operations you're doing (matrix multiplication, inversion, etc?) you can potentially reduce the computation by simplifying the operations given the assumptions you can make about your matrix structure. But I can't speak to that any more deeply since I'm not sure what types of operations you're doing.
But from experience, with a matrix that size, you probably won't see a performance difference. With larger matrices, you will, since the CPU's cache starts to fill. In that case, doing things like separating multiplication and addition operations, pointer indexes, and passing inputs as const enable the compiler to make significant performance enhancements.
See
Optimized matrix multiplication in C and
Cache friendly method to multiply two matrices
Reading a couple of questions about Post-Increment and Pre-Increment I am in need of trying to explain a new programmer in what cases I would actually need one or the other. In what type of scenarios one would apply a Post-Increment Operator and in what it is better to apply a Pre-Increment one.
This is to teach case studies where, in a particular code, one would need to apply one or the other in order to obtain specific values for certain tasks.
The short answer is: You never need them!
The long answer is that the instruction sets of early micro-computers had features like that. Upon reading of a memory cell, you could post-increment or pre-decrement that cell when reading it. Such machine level features inspired the predecessors of C, from whence it found its way even into more recent languages.
To understand this, one must remember that RAM was extremely scarce in those days. When you have 64k addressable RAM for your program, you'll find it worth it to write very compact code. The machine architectures of those days reflected this need by providing extremely powerful instructions. Hence you could express code like:
s = s + a[j]
j = j + 1
with just one instruction, given that s and j were in a register.
Thus we have language features in C that allowed the compiler without much effort to generate efficient code line:
register int s = 0; // clr r5
s += a[j++]; // mov j+, r6 move j to r6 and increment j after
// add r5,a[r6]
The same goes for the short-cut operations like +=, -=, *= etc.
They are
A way to save typing
A help for a compiler that had to fit in small RAM, and couldn't afford much optimizations
For example,
a[i] *= 5
which is short for
a[i] = a[i] * 5
in effect saves the compiler some form of common subexpression analysis.
And yet, all that language features, can always be replaced by equivalent, maybe a bit longer code that doesn't use them. Modern compilers should translate them to efficient code, just like the shorter forms.
So the bottom line and answer to your question: you don't need to look for cases where one needs to apply those operators. Such cases simply do not exits.
Well, some people like Douglas Crockford are against using those operators because they can lead to unexpected behaviors by hiding the final result from the untrained eye.
But, since I'm using JSHint, let's share an example here:
http://jsfiddle.net/coma/EB72c/
var List = function(values) {
this.index = 0;
this.values = values;
};
List.prototype.next = function() {
return this.values[++this.index];
};
List.prototype.prev = function() {
return this.values[--this.index];
};
List.prototype.current = function() {
return this.values[this.index];
};
List.prototype.prefix = function(prefixes) {
var i;
for (i = 0; i < this.values.length; i++) {
this.values[i] = prefixes[i] + this.values[i];
}
};
var animals = new List(['dog', 'cat', 'parrot']);
console.log('current', animals.current());
console.log('next', animals.next());
console.log('next', animals.next());
console.log('current', animals.current());
console.log('prev', animals.prev());
console.log('prev', animals.prev());
animals.prefix(['Snoopy the ', 'Gartfield the ', 'A ']);
console.log('current', animals.current());
As others have said, you never "need" either flavor of the ++ and -- operators. Then again, there are a lot of features of languages that you never "need" but that are useful in clearly expressing the intent of the code. You don't need an assignment that returns a value either, or unary negation since you can always write (0-x)... heck, if you push that to the limit you don't need C at all since you can always write in assembler, and you don't need assembler since you can always just set the bits to construct the instructions by hand...
(Cue Frank Hayes' song, When I Was A Boy. And get off my lawn, you kids!)
So the real answer here is to use the increment and decrement operators where they make sense stylistically -- where the value being manipulated is in some sense a counter and where it makes sense to advance the count "in passing". Loop control is one obvious place where people expect to see increment/decrement and where it reads more clearly than the alternatives. And there are many C idioms which have almost become meta-statements in the language as it is actually used -- the classic one-liner version of strcpy(), for example -- and which an experienced C programmer will recognize at a glance and be able to recreate at need; many of those do take advantage of increment/decrement as side effect.
Unfortunately, "where it makes sense stylistically" is not a simple rule to teach. As with any other aspect of coding style, it really needs to come from exposure to other folks' code and from an understanding of how "native speakers" of the programming language think about the code.
That isn't a very satisfying answer, I know. But I don't think a better one exists.
Usually, it just depends on what you need in your specific case. If you need the result of the operation, it's just a question of whether you need the value of the variable before or after incrementing / decrementing. So use the one which makes the code more clear to understand.
In some languages like C++ it is considered good practice to use the pre-increment / pre-decrement operators if you don't need the value, since the post operators need to store the previous value in a temporary during the operation, therefore they require more instructions (additional copy) and can cause performance issues for complex types.
I'm no C expert but I don't think it really matters which you use in C, since there are no increment / decrement operators for large structs.
I have a vector A, represented by an angle and a length. I want to add vector B, updating the original A. B comes from a lookup table, so it can be represented in which ever way makes the computation easier.
Specifically, A is defined thusly:
uint16_t A_angle; // 0-65535 = 0-2π
int16_t A_length;
Approximations are fine. Checking for overflow is not necessary. A fast sin/cos approximation is available.
The fastest way I can think is to have B represented as a component vector, convert A to component, add A and B, convert the result back to angle/length and replace A. (This requires the addition of a fast asin/acos)
I am not especially good at math and wonder if I am missing a more sensible approach?
I am primarily looking for a general approach, but specific answers/comments about useful micro-optimizations in C is also interesting.
If you need to do a lot of additive operations, it would probably be worth considering storing everything in Cartesian coordinates, rather than polar.
Polar is well-suited to rotation operations (and scaling, I guess), but sticking with Cartesian (where a rotation is four multiplies, see below) is probably going to be cheaper than using cos/sin/acos/asin every time you want to do a vector addition. Although, of course, it depends on the distribution of operations in your case.
FYI, a rotation in Cartesian coordinates is as follows (see http://en.wikipedia.org/wiki/Rotation_matrix):
x' = x.cos(a) - y.sin(a)
y' = x.sin(a) + y.cos(a)
If a is known ahead of time, then cos(a) and sin(a) can be precomputed.