I have a question regarding wether it is good to save arithmetics computations to limit the stack usage.
Let's say I have a recursive function like this one :
void foo (unsigned char x, unsigned char z) {
if (!x || !z)
return;
// Do something
for (unsigned char i = 0; i < 100; ++i) {
foo(x - 1, z);
foo(x, z - 1);
}
}
The main thing to see here are the x - 1 and z - 1 evaluated each time in the loop.
To increase performance, I would do something like this :
const unsigned char minus_x = x - 1;
const unsigned char minus_z = z - 1;
for (unsigned char i = 0; i < 100; ++i) {
foo(minus_x, z);
foo(x, minus_z);
}
But doing this means that on each call, minus_x and minus_z are saved on the stack. The recursive function might be called thousand of times, which means thousand bytes used in the stack. Also, the maths involved aren't as simple as a -1.
Is this a good idea ?
Edit : It is actually useless, since it is a pretty standard optimization for compilers : Loop-invariant code motion (see HansPassant's comment)
Would it be a better idea to use a static array containing the computations like :
static const char minuses[256] = {/* 0 for x = 0; x - 1 for x = 1 to 255 */}
and then do :
foo(minuses[x], z);
foo(x, minuses[z]);
This approach limits a lot the actual maths needed but on each call, it has to get the cell in the array instead of reading it from a register.
I am trying to benchmark as much as I can to find the best solution, but if there is a best practice or something I am missing here, please let me know.
FWIW, I tried this with gcc, for two functions foo_1() (no extra variables) and foo_2() (extra variables).
With -03 gcc unrolled the for loop (!), the two functions were exactly the same size, but not quite the same code. I regret I don't have time to work out how and why they differed.
With -02 gcc generated exactly the same code for foo_1 and foo_2. As one might expect it allocated a register to x, z, x-1, z-1 and i, and pushed/popped those to preserve the parent's values -- using 6 x 8 (64-bit machine) bytes of stack for each call (including the return address).
You report 24 bytes of stack used... is that a 32-bit machine ?
With -O0, the picture was different, foo_1 did the x-1 and z-1 each time round the loop, and in both cases the variables were held in memory. foo_1 was slightly shorter and I suspect that the subtraction makes no difference on a modern processor ! In this case, foo_1 and foo_2 used the same amount of stack. This is because all the variables in foo are unsigned char, and the extra minus_x and minus_z pack together with the i, using space which is otherwise padding. If you change minus_x and minus_z to unsigned long long, you get a difference. Curiously, foo_1 used 6 x 8 bytes of stack as well. There were 16 unused bytes in the stack frame, so even taking into account aligning the RSP and the RBP to 16 byte boundaries, it appears to be using more than it needs to... I have no idea why.
I had a quick look at a static array of x - 1. For -O0 it made no difference to the stack use (for the same reason as before). For -O2, it took one look at foo(x, minuses[z]); and hoisted the minuses[z] out of the loop ! Which one ought to have expected... and the stack use stayed the same (at 6 x 8).
More generally, as noted elsewhere, any effective amount of optimisation is going to hoist calculations out of loops where it can. The other thing which is going on is heavy use of registers to hold variables -- both real variables (those you have named) and "pseudo" variables (to hold the pre-calculated result of hoisted stuff). Those registers need to be saved across calls of subroutines -- either by the caller or the callee. The x86 push/pop operate on the entire register, so an unsigned char held in a register is going to need a full 8 or 4 (64-bit or 32-bit mode) bytes of stack. But, hey, that's what you pay for the optimisation !
It's not entirely clear to me whether its the run-time or the stack-use which you are most concerned about. Either way, the message is to leave it up to the compiler and worry if and only if the thing is too slow, and then only worry about the bits which profiling shows are a problem !
Related
I'm trying to implement AES/DES/.. encryption/decryption in software without using any input dependent operations (specifically only using constant time not, and, or, xor operations and input independent array indexing/loops).
Is there any way to implement input independent logical shift (someconst << key[3] & 5 etc.)?
Array indexing with input dependent variable, using hardware shifts with input dependent n, input dependent conditional jumps must be avoided and I don't care about code size/speed.
Depending on your requirements and which operations you can assume to be constant time, this code needs some additional modifications.
However, it might point you in the right direction (as the SELECT primitive is quite powerful for side-channel free code):
#define MAX_SHIFT 32 // maximum amount to be shifted
// this may not be constant time.
// However, you can find different (more ugly) ways to achieve the same thing.
// 1 -> 0
// 0 -> 0xff...
#define MASK(cond) (cond - 1)
// again, make sure everything here is constant time according to your threat model
// (0, x, y) -> y
// (i, x, y) -> x (i != 0)
#define SELECT(cond, A, B) ((MASK(!(cond)) & A) | (MASK(!!(cond)) & B))
int shift(int value, int shift){
int result = value;
for(int i = 0; i <= MAX_SHIFT; i++){
result = SELECT(i ^ shift, result, value);
// this may not be constant time. If it is not, implement it yourself ;)
value <<= 1;
}
return result;
}
Note, however, that you have to make sure the compiler does not optimize this.
Also, CPUs may also employ operand-dependent performance optimizations, that may lead to timing differences.
In addition to this, transient execution attacks like Spectre may also be a possible threat.
In conclusion: It is almost impossible to write side-channel free code.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
My project is to scan an address space (which in my case is 0x00000000 - 0xffffffff, or 0 - (232)-1) for a pattern and return in an array the locations in memory where the pattern was found (could be found multiple times).
Since the address space is 32 bits, i is a double and max is pow(2,32) (also a double).
I want to keep the original value of i intact so that I can use that to report the location of where the pattern was found (since actually finding the pattern requires moving forward several bytes past i), so I want temp, declared as char *, to copy the value of i. Then, later in my program, I will dereference temp.
double i, max = pow(2, 32);
char *temp;
for (i = 0; i < max; i++)
{
temp = (char *) i;
//some code involving *temp
}
The issue I'm running into is a double can't be cast as a char *. An int can be; however, since the address space is 32 bits (not 16) I need a double, which is exactly large enough to represent 2^32.
Is there anything I can do about this?
In C, double and float are not represented the way you think they are; this code demonstrates that:
#include <stdio.h>
typedef union _DI
{
double d;
int i;
} DI;
int main()
{
DI di;
di.d = 3.00;
printf("%d\n", di.i);
return 0;
}
You will not see an output of 3 in this case.
In general, even if you could read other process' memory, your strategy is not going to work on any modern operating system because of virtual memory (the address space that one process "sees" doesn't necessarily (in fact, it usually doesn't) represent the physical memory on the system).
Never use a floating point variable to store an integer. Floating point variables make approximate computations. It would happen to work in this case, because the integers are small enough, but to know that, you need intimate knowledge of how floating point works on a particular machine/compiler and what range of integers you'll be using. Plus it's harder to write the program, and the program would be slower.
C defines an integer type that's large enough to store a pointer: uintptr_t. You can cast a pointer to uintptr_t and back. On a 32-bit machine, uintptr_t will be a 32-bit type, so it's only able to store values up to 232-1. To express a loop that covers the whole range of the type including the first and last value, you can't use an ordinary for loop with a variable that's incremented, because the ending condition requires a value of the loop index that's out of range. If you naively write
uintptr_t i;
for (i = 0; i <= UINTPTR_MAX; i++) {
unsigned char *temp = (unsigned char *)i;
// ...
}
then you get an infinite loop, because after the iteration with i equal to UINTPTR_MAX, running i++ wraps the value of i to 0. The fact that the loop is infinite can also be seen in a simpler logical way: the condition i <= UINTPTR_MAX is always true since all values of the type are less or equal to the maximum.
You can fix this by putting the test near the end of the loop, before incrementing the variable.
i = 0;
do {
unsigned char *temp = (unsigned char *)i;
// ...
if (i == UINTPTR_MAX) break;
i++;
} while (1);
Note that exploring 4GB in this way will be extremely slow, if you can even do it. You'll get a segmentation fault whenever you try to access an address that isn't mapped. You can handle the segfault with a signal handler, but that's tricky and slow. What you're attempting may or may not be what your teacher expects, but it doesn't make any practical sense.
To explore a process's memory on Linux, read /proc/self/maps to discover its memory mappings. See my answer on Unix.SE for some sample code in Python.
Note also that if you're looking for a pattern, you need to take the length of the whole pattern into account, a byte-by-byte lookup doesn't do the whole job.
Ahh, a school assignment. OK then.
uint32_t i;
for ( i = 0; i < 0xFFFFFFFF; i++ )
{
char *x = (char *)i;
// Do magic here.
}
// Also, the above code skips on 0xFFFFFFFF itself, so magic that one address here.
// But if your pattern is longer than 1 byte, then it's not necessary
// (in fact, use something less than 0xFFFFFFFF in the above loop then)
The cast of a double to a pointer is a constraint violation - hence the error.
A floating type shall not be converted to any pointer type. C11dr ยง6.5.4 4
To scan the entire 32-bit address space, use a do loop with an integer type capable of the [0 ... 0xFFFFFFFF] range.
uint32_t address = 0;
do {
char *p = (char *) address;
foo(p);
} while (address++ < 0xFFFFFFFF);
I am using the IAR C compiler to build an application for an embedded micro (specifically a Renesas uPD78F0537). In this application I am using two nested for loops to initialize some data, as shown in the following MCVE:
#include <stdio.h>
#define NUM_OF_OBJS 254
#define MAX_OBJ_SIZE 4
unsigned char objs[NUM_OF_OBJS][MAX_OBJ_SIZE];
unsigned char srcData[NUM_OF_OBJS][MAX_OBJ_SIZE];
void main(void)
{
srcData[161][2] = 10;
int x, y;
for (x = 0; x < NUM_OF_OBJS; x++)
{
for (y = 0; y < MAX_OBJ_SIZE; y++)
{
objs[x][y] = srcData[x][y];
}
}
printf("%d\n", (int) objs[161][2]);
}
The output value is 0, not 10.
The compiler is generating the following code for the for loop:
13 int x, y;
14 for (x = 0; x < NUM_OF_OBJS; x++)
\ 0006 14.... MOVW DE,#objs
\ 0009 16.... MOVW HL,#srcData
15 {
16 for (y = 0; y < MAX_OBJ_SIZE; y++)
\ 000C A0F8 MOV X,#248
17 {
18 objs[x][y] = srcData[x][y];
\ ??main_0:
\ 000E 87 MOV A,[HL]
\ 000F 95 MOV [DE],A
19 }
\ 0010 86 INCW HL
\ 0011 84 INCW DE
\ 0012 50 DEC X
\ 0013 BDF9 BNZ ??main_0
20 }
The above does not work: The compiler is apparently precalculating NUM_OF_OBJS x MAX_OBJ_SIZE = 1016 (0x3f8). This value is used as a counter, however it is being truncated to 8 bits (0xf8 == 248) and stored in 8-bit register 'X'. As a result, only the first 248 bytes of data are initialised, instead of the full 1016 bytes.
I can work around this, however my question is: Is this a compiler bug? Or am I overlooking something?
Update
This microcontroller has 7KB of RAM, and sizeof(int) == 2
Copying the data over using pointers (e.g. something along the lines of while (len-- > 0) *dst++ = *src++;) works fine.
I am pretty much convinced that this is a compiler bug based on the following:
Copying the data over using pointers (e.g. something along the lines of while (len-- > 0) *dst++ = *src++;) works fine. Thus this does not look like a problem of RAM size, pointer size, etc.
More relevant perhaps: If I just replace one of the two constants (NUM_OF_OBJS or MAX_OBJ_SIZE) with a static variable (thus preventing the compiler from precalculating the total count) it works fine.
Unfortunately I contacted IAR (providing a link to this SO question) and this is their answer:
Sorry to read that you don't have a license/support-agreement (SUA).
An examination such as needed for this case can take time, and we
(post-sales-support) prioritize to put time and effort to users with
valid SUA.
Along with some generic comments which are not particularly useful.
So I guess I'll just assume this is a bug and work around it in my code (there are many possible workarounds, including the two described above).
Although technically one may assume that those arrays are in .bss and zeroed, compilers are now issuing warnings when you actually make that assumption and read something from bss before writing it. Being global an interrupt, etc could come in and modify those arrays, so is it really a safe assumption (by a compiler) that they are zeros. Gcc does not make this assumption it copies the whole array over. Clang/llvm as well copies the whole array, does not take a shortcut. A shortcut would have looked like writing the one value to both arrays and printing that value, instead of copying the whole array.
Interesting even if the arrays were declared static, or made local, gcc is not able to figure out the shortcut, which is strange for gcc, which isnt the best but usually figures out much more complicated dead code loops.
What is the size of int defined as for this compiler target, maybe int is 8 bits in size and the compiler performed the action you asked. If not then perhaps yes this is a compiler bug. Try making the 254 a little bit smaller does it change the 0xF8 by a proportional amount, does it always appear as if it is carving off the lower 8 bits or is this some magic number that 0xF8 has some relationship to 1016 or 254 or 161, etc.
Given that this is the full code, the compiler isn't obliged to do anything, since the code is nonsense. The compiler is free to optimize away the whole loop since it does nothing.
A fully standard compliant compiler must initialize both objs and srcData to all-zeroes, since they have static storage duration.
Therefore the nested loop does nothing but shovelling zeroes from one array to another. If the compiler notices that this loop is pointless, it is free to remove the loop entirely.
Of course it doesn't make much sense to reduce the number of iterations in the loop as a way of optimization, so one may wonder what weird decisions the optimizer took to come up with that machine code. Odd and dumb as it may seem, it is fully standard compliant.
You could declare the loop iterators as volatile to enforce side-effects. In that case the loop has to be executed even though it is pointless, since reading/writing to a volatile variable is a side-effect, which the compiler isn't allowed to optimize away.
Please keep in mind that embedded systems compilers often have a non-standard, "minimal start-up" option where they skip the initialization of static storage duration variables to achieve faster system boot-up. If such a non-standard option is enabled, the variables will contain garbage values instead of zeroes.
This question already has answers here:
Why don't people use xor swaps? [closed]
(4 answers)
Closed 7 years ago.
I have come across a "technique" for swapping 2 variables (ints, chars or pointers) without a third temp variable , like this :
int a = Something ;
int b = SomethingElse ;
a ^= b ;
b ^= a ;
a ^= b ;
Normal way would be :
int a = Something ;
int b = SomethingElse ;
int temp = a ;
a = b ;
b = temp ;
All this is fine, but the folks who share this "technique" usually state it as without using extra space.
(A) Is this true that there is no extra space ? I think, "memory to memory copy" would require fewer instructions (machine code) compared to "memory to memory XOR operation".
int temp = a <==> move content of memory location a to memory location temp **(1 instruction)**, possibly via a register **(2 instructions)**
a ^= b <==> move contents of memory location a to register1, move contents of memory location b to register2, xor register1 and register2 (store results in register1) , move register1 to memory location a **(about 4 instructions)**
It seems that the "technique" will result in longer code and longer runtime.
(B) Is the "technique" faster (or better) in some way or in some cases ?
It seems like the "technique" is slower, uses more memory, and not really suitable for floating points.
EDIT:
It seems that there may be some Potential Duplicates :
Why don't people use xor swaps?
But this question is obviously Different:
(A) That question was closed as "Not Constructive" , where it "will likely solicit debate, arguments, polling, or extended discussion", whereas this question is looking for factual references, eg "Is something true ?" & "Is this better ?"
(B) That question is about why people do not use the "technique", while this question is about the analysis of the "technique", without looking onto why people use it or do not use it.
There's no definitive answer: it depends too much on:
the architecture of the machine (number of processors, memory cache size, etc); and
the quality of the compiler's optimisation.
If all the operations are performed within registers, there is unlikely to be a performance penalty of XOR compared with copy.
If you use a third variable you could help the compiler by declaring:
register int temp;
The use of the XOR technique (or addition and subtraction, as in a-=b; b+=a; a=b-a;) dates back to when memory was a crucial resource and saving a stack entry could be very important. These days the only value of this is code obfuscation.
I have absolutely no idea what the effect of XOR would be on floating-point values, but I suspect they might be converted to integers first: you will need to try it, but there is no guarantee that all compilers will give the same result.
For lower level (e.g. assembly) "variables" no longer have a permanent location. Sometimes a variable will be in memory, but sometimes the same variable will be in one register, and sometimes in a different register.
When generating code, the compiler has to keep track of where each variable happens to be at each point. For an operation like a = b, if b is in register1 then the compiler just decides that a is now also in register1. Nothing needs to be moved or copied.
Now consider this:
// r0 contains variable a
// r1 contains variable b
temp = a
// r0 contains variable a and variable temp
// r1 contains variable b
a = b
// r0 contains variable temp
// r1 contains variable b and variable a
b = temp
// r0 contains variable temp and variable b
// r1 contains variable a
If you think about this you'll realise that no data needs to be moved or copied, and no code needs to be generated. Modern CPUs are able to do "nothing" extremely quickly.
Note: If a variable is a local variable (on the stack) and isn't in a register, then the compiler should be able to do the same "renaming" to avoid moving anything. If the variable is a global variable then things get more complicated - most likely causing a load from memory and a store to memory for each global variable (e.g. 2 loads and 2 stores if both variables are global).
For XOR (and addition/subtraction) the compiler might be able to optimise it so that it becomes nothing; but I wouldn't count on it. The temporary variable moves are much more likely to be optimised well.
Of course not everything is small enough to fit in a register. For example, maybe you're swapping structures and they're 1234 bytes each. In this case the programmer might use memcpy() and the compiler has to do the copies; or the programmer might do one field at a time with XOR (or addition/subtraction) and the compiler has to do that. For these the memcpy() is much more likely to be better optimised. However, maybe the programmer is smarter, and doesn't swap the 1234 bytes of data and only swaps pointers to the data. Pointers fit in registers (even when the data they point to doesn't), so maybe no code is generated for that.
If you are manipulating ints or doubles (or more generally, any type that hold addition and subtraction operators), you can do it like this :
int a = 5;
int b = 7;
a += b; // a == 12, b == 7
b = a - b; // a == 12, b == 5
a -= b; // a == 7, b == 5
I have been wondering for a while which of the two following methods are faster or better.
MY CURRENT METHOD
I'm developing a chess game and the pieces are stored as numbers (really bytes to preserve memory) into a one-dimensional array. There is a position for the cursor corresponding to the index in the array. To access the piece at the current position in the array is easy (piece = pieces[cursorPosition]).
The problem is that to get the x and y values for checking if the move is a valid move requires the division and a modulo operators (x = cursorPosition % 8; y = cursorPosition / 8).
Likewise when using x and y to check if moves are valid (you have to do it this way for reasons that would fill the entire page), you have to do something like - purely as an example - if pieces[y * 8 + x] != 0: movePiece = False. The obvious problem is having to do y * 8 + x a bunch of times to access the array.
Ultimately, this means that getting a piece is trivial but then getting the x and y requires another bit of memory and a very small amount of time to compute it each round.
A MORE TRADITIONAL METHOD
Using a two-dimensional array, one can implement the above process a little easier except for the fact that piece lookup is now a little harder and more memory is used. (I.e. piece = pieces[cursorPosition[0]][cursorPosition[1]] or piece = pieces[x][y]).
I don't think this is faster and it definitely doesn't look less memory intensive.
GOAL
My end goal is to have the fastest possible code that uses the least amount of memory. This will be developed for the unix terminal (and potentially Windows CMD if I can figure out how to represent the pieces without color using Ansi escape sequences) and I will either be using a secure (encrypted with protocol and structure) TCP connection to connect people p2p to play chess or something else and I don't know how much memory people will have or how fast their computer will be or how strong of an internet connection they will have.
I also just want to learn to do this the best way possible and see if it can be done.
-
I suppose my question is one of the following:
Which of the above methods is better assuming that there are slightly more computations involving move validation (which means that the y * 8 + x has to be used a lot)?
or
Is there perhaps a method that includes both of the benefits of 1d and 2d arrays with not as many draw backs as I described?
First, you should profile your code to make sure that this is really a bottleneck worth spending time on.
Second, if you're representing your position as an unsigned byte decomposing it into X and Y coordinates will be very fast. If we use the following C code:
int getX(unsigned char pos) {
return pos%8;
}
We get the following assembly with gcc 4.8 -O2:
getX(unsigned char):
shrb $3, %dil
movzbl %dil, %eax
ret
If we get the Y coordinate with:
int getY(unsigned char pos) {
return pos/8;
}
We get the following assembly with gcc 4.8 -O2:
getY(unsigned char):
movl %edi, %eax
andl $7, %eax
ret
There is no short answer to this question; it all depends on how much time you spend optimizing.
On some architectures, two-dimensional arrays might work better than one-dimensional. On other architectures, bitmapped integers might be the best.
Do not worry about division and multiplication.
You're dividing, modulating and multiplying by 8.
This number is in the power of two, thus any computer can use bitwise operations in order to achieve the result.
(x * 8) is the same as (x << 3)
(x % 8) is the same as (x & (8 - 1))
(x / 8) is the same as (x >> 3)
Those operations are normally performed in a single clock cycle. On many modern architectures, they can be performed in less than a single clock cycle (including ARM architectures).
Do not worry about using bitwise operators instead of *, % and /. If you're using a compiler that's less than a decade old, it'll optimize it for you and use bitwise operations.
What you should focus on instead, is how easy it will be for you to find out whether or not a move is legal, for instance. This will help your computer-player to "think quickly".
If you're using an 8*8 array, then it's easy for you to see where a castle can move by checking if only x or y is changed. If checking the queen, then X must either be the same or move the same number of steps as the Y position.
If you use a one-dimensional array, you also have advantages.
But performance-wise, it might be a real good idea to use a 16x16 array or a 1x256 array.
Fill the entire array with 0x80 values (eg. "illegal position"). Then fill the legal fields with 0x00.
If using a 1x256 array, you can check bit 3 and 7 of the index. If any of those are set, then the position is outside the board.
Testing can be done this way:
if(position & 0x88)
{
/* move is illegal */
}
else
{
/* move is legal */
}
... or ...
if(0 == (position & 0x88))
{
/* move is legal */
}
'position' (the index) should be an unsigned byte (uint8_t in C). This way, you'll never have to worry about pointing outside the buffer.
Some people optimize their chess-engines by using 64-bit bitmapped integers.
While this is good for quickly comparing the positions, it has other disadvantages; for instance checking if the knight's move is legal.
It's not easy to say which is better, though.
Personally, I think the one-dimensional array in general might be the best way to do it.
I recommend getting familiar (very familiar) with AND, OR, XOR, bit-shifting and rotating.
See Bit Twiddling Hacks for more information.