I've been looking for a custom polymorphic solution to improve binary compatibility. The problem is that pointer members are varying size on different platforms, so even "static" width members get pushed around producing binary incompatible layouts.
They way I understand, most compilers implement v-tables in a similar way:
__________________
| | | |
|0| vtable* | data | -> ->
|_|_________|______|
E.g. the v-table is put as the first element of the object, and in cases of multiple inheritance the inherited classes are sequentially aligned with the appropriate padding to align.
So, my idea is the following, put all v-tables (and all varying "platform width" members) "behind":
__________________
| | | |
<- | vtable* |0| data | ->
|_________|_|______|
This way the layout right of the 0 (alignment boundary for the first data member) is only comprised of types with explicit and portable size and alignment (so it can stay uniform), while at the same time you can still portably navigate through the v-tables and other pointer members using indices with the platform width stride. Also, since all members to the left will be the same size and alignment, this could reduce the need of extra padding in the layout.
Naturally, this means that the "this" pointer will no longer point at the beginning of the object, but will be some offset which will vary with every class. Which means that "new" and "delete" must make adjustments in order for the whole scheme to work. Would that have a measurable negative impact, considering that one way or another, offset calculation takes place when accessing members anyway?
My question is whether someone with more experience can point out potential caveats of using this approach.
Edit:
I did a quick test to determine whether the extra offset calculation will be detrimental to the performance of virtual calls (yeah yeah, I know it is C++ code inside a C question, but I don't have a nanosecond resolution timer for C, plus the whole point so to compare to the existing polymorphism implementation):
class A;
typedef void (*foo)(A*);
void bar(A*) {}
class A {
public:
A() : a(&bar) { }
foo a;
virtual void b() {}
};
int main() {
QVector<A*> c;
int r = 60000000;
QElapsedTimer t;
for (int i = 0; i < r; ++i) c.append(new A);
cout << "allocated " << c.size() << " object in " << (quint64)t.elapsed() << endl;
for (int count = 0; count < 5; ++count) {
t.restart();
for (int i = 0; i < r; ++i) {
A * ap = c[i]; // note that c[i]->a(c[i]) would
ap->a(ap); // actually result in a performance hit
}
cout << t.elapsed() << endl;
t.restart();
for (int i = 0; i < r; ++i) {
c[i]->b();
}
cout << t.elapsed() << endl;
}
}
After testing with 60 million objects (70 million failed to allocate on the 32bit compiler I am currently using) it doesn't look like there is any measurable difference between calling a regular virtual function and calling through a pointer that is not the first element in the object (and therefore needs additional offset calculation), and even though in the case of the function pointer the memory address is passed twice, e.g. pass to find the offset of a and then pass into a). In release mode the time for the two functions are identical (+/- 1 nsec for 60mil calls), and in debug mode the function pointer is actually about 1% faster consistently (maybe a function pointer requires less resources than a virtual function).
The overhead from adjusting the pointer when allocating and deleting also seems to be practically negligible and totally within the margin of error. Which is kind of expected, considering it should add no more than a single increment of a value that is already on a register with an immediate value, something that should take a single cycle on the platforms I intend to target.
FYI, the address of the vtable is placed at the first DWORD/QWORD of the object, not the table itself. the vtable is shared between objects of the same class/struct.
Having different vtable sizes between platforms is a non-issue BTW. Incompatible platforms can't execute native code of other platforms and for binary translation to work, the emulator needs to know the original architecture.
The main drawbacks to your solution are performance and complexity over the current implementations.
Related
I'm trying to implement AES/DES/.. encryption/decryption in software without using any input dependent operations (specifically only using constant time not, and, or, xor operations and input independent array indexing/loops).
Is there any way to implement input independent logical shift (someconst << key[3] & 5 etc.)?
Array indexing with input dependent variable, using hardware shifts with input dependent n, input dependent conditional jumps must be avoided and I don't care about code size/speed.
Depending on your requirements and which operations you can assume to be constant time, this code needs some additional modifications.
However, it might point you in the right direction (as the SELECT primitive is quite powerful for side-channel free code):
#define MAX_SHIFT 32 // maximum amount to be shifted
// this may not be constant time.
// However, you can find different (more ugly) ways to achieve the same thing.
// 1 -> 0
// 0 -> 0xff...
#define MASK(cond) (cond - 1)
// again, make sure everything here is constant time according to your threat model
// (0, x, y) -> y
// (i, x, y) -> x (i != 0)
#define SELECT(cond, A, B) ((MASK(!(cond)) & A) | (MASK(!!(cond)) & B))
int shift(int value, int shift){
int result = value;
for(int i = 0; i <= MAX_SHIFT; i++){
result = SELECT(i ^ shift, result, value);
// this may not be constant time. If it is not, implement it yourself ;)
value <<= 1;
}
return result;
}
Note, however, that you have to make sure the compiler does not optimize this.
Also, CPUs may also employ operand-dependent performance optimizations, that may lead to timing differences.
In addition to this, transient execution attacks like Spectre may also be a possible threat.
In conclusion: It is almost impossible to write side-channel free code.
I am currently programming a 8051 µC in C (with compiler: Wickehaeuser µC/51) and so that, I am thinking which way is the best, to pupulate a structure. In my current case I have a time/date structure which should be pupulated with the current Time/Date from an RTC via SFRs.
So I am thinking of the best method to do this:
Get the data via return value by creating the variable inside the function (get_foo_create)
Get data via call by reference (get_foo_by_reference)
Get via call by reference plus returning it (by writing I think this is stupid, but I am also thinking about this) (get_foo_by_reference)
The following code is just an example (note: there is currently a failure in the last print, which does not print out the value atm)
Which is the best method?
typedef struct {
unsigned char foo;
unsigned char bar;
unsigned char baz;
}data_foo;
data_foo get_foo_create(void) {
data_foo foo;
foo.bar = 2;
return foo;
}
void get_foo_by_reference(data_foo *foo) {
// Read values e.g. from SFR
foo->bar = 42; // Just simulate SFR
}
data_foo *get_foo_pointer_return(data_foo *foo) {
// Read values e.g. from SFR
(*foo).bar = 11; // Just simulate SFR
return foo;
}
/**
* Main program
*/
void main(void) {
data_foo struct_foo;
data_foo *ptr_foo;
seri_init(); // Serial Com init
clear_screen();
struct_foo = get_foo_create();
printf("%d\n", struct_foo.bar);
get_foo_by_reference(&struct_foo);
printf("%d\n", struct_foo.bar);
ptr_foo = get_foo_pointer_return(&ptr_foo);
//Temp problem also here, got 39 instead 11, tried also
//printf("%d\n",(void*)(*ptr_foo).bar);
printf("%d\n",(*ptr_foo).bar);
SYSTEM_HALT; //Programm end
}
On the 8051, you should avoid using pointers to the extent possible. Instead, it's generally best--if you can afford it--to have some global structures which will be operated upon by various functions. Having functions for "load thing from address" and "store thing to address", along with various functions that manipulate thing, can be much more efficient than trying to have functions that can operate on objects of that type "in place".
For your particular situation, I'd suggest having a global structure called "time", as well as a global union called "ldiv_acc" which combines a uint_32, two uint16_t, and four uint8_t. I'd also suggest having an "ldivmod" function which divides the 32-bit value in ldiv_acc by an 8-bit argument, leaving the quotient in ldiv_acc and returning the remainder, as well as an "lmul" function which multiplies the 32-bit value in ldiv_acc by an 8-bit value. It's been a long time since I've programmed the 8051, so I'm not sure what help compilers need to generate good code, but 32x32 divisions and multiplies are going to be expensive compared with using a combination of 8x8 multiplies and divides.
On the 8051, code like:
uint32_t time;
uint32_t sec,min,hr;
sec = time % 60;
time /= 60;
min = time % 60;
time /= 60;
hr = time % 24;
time /= 24;
is likely to be big and slow. Using something like:
ldiv_acc.l = time;
sec = ldivmod(60);
min = ldivmod(60);
hr = ldivmod(24);
is apt to be much more compact and, if you're clever, faster. If speed is really important, you could use functions to perform divmod6, divmod10, divmod24, and divmod60, taking advantage of the fact that divmod60(h+256*l) is equal to h*4+divmod60(h*16+l). The second addition might yield a value greater than 256, but if it does, applying the same technique will get the operand below 256. Dividing an unsigned char by another unsigned char is faster than divisions involving unsigned int.
This question already has answers here:
Why don't people use xor swaps? [closed]
(4 answers)
Closed 7 years ago.
I have come across a "technique" for swapping 2 variables (ints, chars or pointers) without a third temp variable , like this :
int a = Something ;
int b = SomethingElse ;
a ^= b ;
b ^= a ;
a ^= b ;
Normal way would be :
int a = Something ;
int b = SomethingElse ;
int temp = a ;
a = b ;
b = temp ;
All this is fine, but the folks who share this "technique" usually state it as without using extra space.
(A) Is this true that there is no extra space ? I think, "memory to memory copy" would require fewer instructions (machine code) compared to "memory to memory XOR operation".
int temp = a <==> move content of memory location a to memory location temp **(1 instruction)**, possibly via a register **(2 instructions)**
a ^= b <==> move contents of memory location a to register1, move contents of memory location b to register2, xor register1 and register2 (store results in register1) , move register1 to memory location a **(about 4 instructions)**
It seems that the "technique" will result in longer code and longer runtime.
(B) Is the "technique" faster (or better) in some way or in some cases ?
It seems like the "technique" is slower, uses more memory, and not really suitable for floating points.
EDIT:
It seems that there may be some Potential Duplicates :
Why don't people use xor swaps?
But this question is obviously Different:
(A) That question was closed as "Not Constructive" , where it "will likely solicit debate, arguments, polling, or extended discussion", whereas this question is looking for factual references, eg "Is something true ?" & "Is this better ?"
(B) That question is about why people do not use the "technique", while this question is about the analysis of the "technique", without looking onto why people use it or do not use it.
There's no definitive answer: it depends too much on:
the architecture of the machine (number of processors, memory cache size, etc); and
the quality of the compiler's optimisation.
If all the operations are performed within registers, there is unlikely to be a performance penalty of XOR compared with copy.
If you use a third variable you could help the compiler by declaring:
register int temp;
The use of the XOR technique (or addition and subtraction, as in a-=b; b+=a; a=b-a;) dates back to when memory was a crucial resource and saving a stack entry could be very important. These days the only value of this is code obfuscation.
I have absolutely no idea what the effect of XOR would be on floating-point values, but I suspect they might be converted to integers first: you will need to try it, but there is no guarantee that all compilers will give the same result.
For lower level (e.g. assembly) "variables" no longer have a permanent location. Sometimes a variable will be in memory, but sometimes the same variable will be in one register, and sometimes in a different register.
When generating code, the compiler has to keep track of where each variable happens to be at each point. For an operation like a = b, if b is in register1 then the compiler just decides that a is now also in register1. Nothing needs to be moved or copied.
Now consider this:
// r0 contains variable a
// r1 contains variable b
temp = a
// r0 contains variable a and variable temp
// r1 contains variable b
a = b
// r0 contains variable temp
// r1 contains variable b and variable a
b = temp
// r0 contains variable temp and variable b
// r1 contains variable a
If you think about this you'll realise that no data needs to be moved or copied, and no code needs to be generated. Modern CPUs are able to do "nothing" extremely quickly.
Note: If a variable is a local variable (on the stack) and isn't in a register, then the compiler should be able to do the same "renaming" to avoid moving anything. If the variable is a global variable then things get more complicated - most likely causing a load from memory and a store to memory for each global variable (e.g. 2 loads and 2 stores if both variables are global).
For XOR (and addition/subtraction) the compiler might be able to optimise it so that it becomes nothing; but I wouldn't count on it. The temporary variable moves are much more likely to be optimised well.
Of course not everything is small enough to fit in a register. For example, maybe you're swapping structures and they're 1234 bytes each. In this case the programmer might use memcpy() and the compiler has to do the copies; or the programmer might do one field at a time with XOR (or addition/subtraction) and the compiler has to do that. For these the memcpy() is much more likely to be better optimised. However, maybe the programmer is smarter, and doesn't swap the 1234 bytes of data and only swaps pointers to the data. Pointers fit in registers (even when the data they point to doesn't), so maybe no code is generated for that.
If you are manipulating ints or doubles (or more generally, any type that hold addition and subtraction operators), you can do it like this :
int a = 5;
int b = 7;
a += b; // a == 12, b == 7
b = a - b; // a == 12, b == 5
a -= b; // a == 7, b == 5
I have been wondering for a while which of the two following methods are faster or better.
MY CURRENT METHOD
I'm developing a chess game and the pieces are stored as numbers (really bytes to preserve memory) into a one-dimensional array. There is a position for the cursor corresponding to the index in the array. To access the piece at the current position in the array is easy (piece = pieces[cursorPosition]).
The problem is that to get the x and y values for checking if the move is a valid move requires the division and a modulo operators (x = cursorPosition % 8; y = cursorPosition / 8).
Likewise when using x and y to check if moves are valid (you have to do it this way for reasons that would fill the entire page), you have to do something like - purely as an example - if pieces[y * 8 + x] != 0: movePiece = False. The obvious problem is having to do y * 8 + x a bunch of times to access the array.
Ultimately, this means that getting a piece is trivial but then getting the x and y requires another bit of memory and a very small amount of time to compute it each round.
A MORE TRADITIONAL METHOD
Using a two-dimensional array, one can implement the above process a little easier except for the fact that piece lookup is now a little harder and more memory is used. (I.e. piece = pieces[cursorPosition[0]][cursorPosition[1]] or piece = pieces[x][y]).
I don't think this is faster and it definitely doesn't look less memory intensive.
GOAL
My end goal is to have the fastest possible code that uses the least amount of memory. This will be developed for the unix terminal (and potentially Windows CMD if I can figure out how to represent the pieces without color using Ansi escape sequences) and I will either be using a secure (encrypted with protocol and structure) TCP connection to connect people p2p to play chess or something else and I don't know how much memory people will have or how fast their computer will be or how strong of an internet connection they will have.
I also just want to learn to do this the best way possible and see if it can be done.
-
I suppose my question is one of the following:
Which of the above methods is better assuming that there are slightly more computations involving move validation (which means that the y * 8 + x has to be used a lot)?
or
Is there perhaps a method that includes both of the benefits of 1d and 2d arrays with not as many draw backs as I described?
First, you should profile your code to make sure that this is really a bottleneck worth spending time on.
Second, if you're representing your position as an unsigned byte decomposing it into X and Y coordinates will be very fast. If we use the following C code:
int getX(unsigned char pos) {
return pos%8;
}
We get the following assembly with gcc 4.8 -O2:
getX(unsigned char):
shrb $3, %dil
movzbl %dil, %eax
ret
If we get the Y coordinate with:
int getY(unsigned char pos) {
return pos/8;
}
We get the following assembly with gcc 4.8 -O2:
getY(unsigned char):
movl %edi, %eax
andl $7, %eax
ret
There is no short answer to this question; it all depends on how much time you spend optimizing.
On some architectures, two-dimensional arrays might work better than one-dimensional. On other architectures, bitmapped integers might be the best.
Do not worry about division and multiplication.
You're dividing, modulating and multiplying by 8.
This number is in the power of two, thus any computer can use bitwise operations in order to achieve the result.
(x * 8) is the same as (x << 3)
(x % 8) is the same as (x & (8 - 1))
(x / 8) is the same as (x >> 3)
Those operations are normally performed in a single clock cycle. On many modern architectures, they can be performed in less than a single clock cycle (including ARM architectures).
Do not worry about using bitwise operators instead of *, % and /. If you're using a compiler that's less than a decade old, it'll optimize it for you and use bitwise operations.
What you should focus on instead, is how easy it will be for you to find out whether or not a move is legal, for instance. This will help your computer-player to "think quickly".
If you're using an 8*8 array, then it's easy for you to see where a castle can move by checking if only x or y is changed. If checking the queen, then X must either be the same or move the same number of steps as the Y position.
If you use a one-dimensional array, you also have advantages.
But performance-wise, it might be a real good idea to use a 16x16 array or a 1x256 array.
Fill the entire array with 0x80 values (eg. "illegal position"). Then fill the legal fields with 0x00.
If using a 1x256 array, you can check bit 3 and 7 of the index. If any of those are set, then the position is outside the board.
Testing can be done this way:
if(position & 0x88)
{
/* move is illegal */
}
else
{
/* move is legal */
}
... or ...
if(0 == (position & 0x88))
{
/* move is legal */
}
'position' (the index) should be an unsigned byte (uint8_t in C). This way, you'll never have to worry about pointing outside the buffer.
Some people optimize their chess-engines by using 64-bit bitmapped integers.
While this is good for quickly comparing the positions, it has other disadvantages; for instance checking if the knight's move is legal.
It's not easy to say which is better, though.
Personally, I think the one-dimensional array in general might be the best way to do it.
I recommend getting familiar (very familiar) with AND, OR, XOR, bit-shifting and rotating.
See Bit Twiddling Hacks for more information.
I have a question regarding wether it is good to save arithmetics computations to limit the stack usage.
Let's say I have a recursive function like this one :
void foo (unsigned char x, unsigned char z) {
if (!x || !z)
return;
// Do something
for (unsigned char i = 0; i < 100; ++i) {
foo(x - 1, z);
foo(x, z - 1);
}
}
The main thing to see here are the x - 1 and z - 1 evaluated each time in the loop.
To increase performance, I would do something like this :
const unsigned char minus_x = x - 1;
const unsigned char minus_z = z - 1;
for (unsigned char i = 0; i < 100; ++i) {
foo(minus_x, z);
foo(x, minus_z);
}
But doing this means that on each call, minus_x and minus_z are saved on the stack. The recursive function might be called thousand of times, which means thousand bytes used in the stack. Also, the maths involved aren't as simple as a -1.
Is this a good idea ?
Edit : It is actually useless, since it is a pretty standard optimization for compilers : Loop-invariant code motion (see HansPassant's comment)
Would it be a better idea to use a static array containing the computations like :
static const char minuses[256] = {/* 0 for x = 0; x - 1 for x = 1 to 255 */}
and then do :
foo(minuses[x], z);
foo(x, minuses[z]);
This approach limits a lot the actual maths needed but on each call, it has to get the cell in the array instead of reading it from a register.
I am trying to benchmark as much as I can to find the best solution, but if there is a best practice or something I am missing here, please let me know.
FWIW, I tried this with gcc, for two functions foo_1() (no extra variables) and foo_2() (extra variables).
With -03 gcc unrolled the for loop (!), the two functions were exactly the same size, but not quite the same code. I regret I don't have time to work out how and why they differed.
With -02 gcc generated exactly the same code for foo_1 and foo_2. As one might expect it allocated a register to x, z, x-1, z-1 and i, and pushed/popped those to preserve the parent's values -- using 6 x 8 (64-bit machine) bytes of stack for each call (including the return address).
You report 24 bytes of stack used... is that a 32-bit machine ?
With -O0, the picture was different, foo_1 did the x-1 and z-1 each time round the loop, and in both cases the variables were held in memory. foo_1 was slightly shorter and I suspect that the subtraction makes no difference on a modern processor ! In this case, foo_1 and foo_2 used the same amount of stack. This is because all the variables in foo are unsigned char, and the extra minus_x and minus_z pack together with the i, using space which is otherwise padding. If you change minus_x and minus_z to unsigned long long, you get a difference. Curiously, foo_1 used 6 x 8 bytes of stack as well. There were 16 unused bytes in the stack frame, so even taking into account aligning the RSP and the RBP to 16 byte boundaries, it appears to be using more than it needs to... I have no idea why.
I had a quick look at a static array of x - 1. For -O0 it made no difference to the stack use (for the same reason as before). For -O2, it took one look at foo(x, minuses[z]); and hoisted the minuses[z] out of the loop ! Which one ought to have expected... and the stack use stayed the same (at 6 x 8).
More generally, as noted elsewhere, any effective amount of optimisation is going to hoist calculations out of loops where it can. The other thing which is going on is heavy use of registers to hold variables -- both real variables (those you have named) and "pseudo" variables (to hold the pre-calculated result of hoisted stuff). Those registers need to be saved across calls of subroutines -- either by the caller or the callee. The x86 push/pop operate on the entire register, so an unsigned char held in a register is going to need a full 8 or 4 (64-bit or 32-bit mode) bytes of stack. But, hey, that's what you pay for the optimisation !
It's not entirely clear to me whether its the run-time or the stack-use which you are most concerned about. Either way, the message is to leave it up to the compiler and worry if and only if the thing is too slow, and then only worry about the bits which profiling shows are a problem !