In C, often we use char for small number representations. However Processor always uses Int( or 32 bit) values for read from(or fetch from) registers. So every time we need to use a char or 8 bits in our program processor need to fetch 32 bits from regsiter and 'parse' 8 bits out of it.
Hence does it sense to use Int more often in place of char if memory is not the limitation?
Will it 'help' processor?
There's the compiler part and the cpu part.
If you tell the compiler you're using a char instead of an int, during static analysis it will know the bounds of the variable is between 0-255 instead of 0-(2^32-1). This will allow it to optimize your program better.
On the cpu side, your assumption isn't always correct. Take x86 as an example, it has registers eax and al for 32 bit and 8 bit register access. If you want to use chars only, using al is sufficient. There is no performance loss.
I did some simple benchmarks in response to below comments:
al:
format PE GUI 4.0
xor ecx, ecx
dec ecx
loop_start:
inc al
add al, al
dec al
dec al
loopd short loop_start
ret
eax:
format PE GUI 4.0
xor ecx, ecx
dec ecx
loop_start:
inc eax
add eax, eax
dec eax
dec eax
loopd short loop_start
ret
times:
$ time ./test_al.exe
./test_al.exe 0.01s user 0.00s system 0% cpu 7.102 total
$ time ./test_eax.exe
./test_eax.exe 0.01s user 0.01s system 0% cpu 7.120 total
So in this case, al is slightly faster, but sometimes eax came out faster. The difference is really negligible. But cpus aren't so simple, there might be code alignment issues, caches, and other things going on, so it's best to benchmark your own code to see if there's any performance improvement. But imo, if your code is not super tight, it's best to trust the compiler to optimize things.
I'd stick to int if I were you as that is probably the most native integral type for your platform. Internally you could expect shorter types to be converted to int so actually degrading performance.
You should never use char and expect it to be consistent across platforms. Although the C standard defines sizeof(char) to be 1, char itself could be signed or unsigned. The choice is down to the compiler.
If you believe that you can squeeze some performance gain in using an 8 bit type then be explicit and use signed char or unsigned char.
From ARM system developers guide
"most ARM data processing operations are 32-bit only. For this reason, you should use
a 32-bit datatype, int or long, for local variables wherever possible. Avoid using char and
short as local variable types, even if you are manipulating an 8- or 16-bit value"
an example code from the book to prove the point. note the wrap around handling for char as opposed to unsigned int.
int checksum_v1(int *data)
{
char i;
int sum = 0;
for (i = 0; i < 64; i++)
{
sum += data[i];
}
return sum;
}
ARM7 assembly when using i as a char
checksum_v1
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ; i = 0
checksum_v1_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
AND r1,r1,#0xff ; i = (char)r1
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v1_loop ; if (i<64) loop
MOV pc,r14 ; return sum
ARM7 assembly when i is an unsigned int.
checksum_v2
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ; i = 0
checksum_v2_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1++
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v2_loop ; if (i<64) goto loop
MOV pc,r14 ; return sum
If your program is simple enough, the optimizer can do the right thing without you having to worry about it. In this case, plain int would be the simplest (and forward-proof) solution.
However, if you want really much to combine specific bit width and speed, you can use 7.18.1.3 Fastest minimum-width integer types from the C99 standard (requires C99-compliant compiler).
For example:
int_fast8_t x;
uint_fast8_t y;
are the signed and unsigned types that are guaranteed to be able to store at least 8 bits of data and use the usually faster underlying type. Of course, it all depends on what you are doing with the data afterwards.
For example, on all systems I have tested (see: standard type sizes in C++) the fast types were 8-bit long.
Related
This was a question that was previously posed by a prof of mine and I'm assuming the 8-bit register is either CL or CH. I got it working by simply moving 01H to the CH register, but I was wondering if there was any other way of doing this since I am technically using the 16-bit CX register as a whole when running the code.
My code for reference :
MOV CH,01H
L1:INC AX ;to keep count
LOOP L1
You're right, your code uses 16-bit CX. Much worse, it depends on CL being zero before this snippet executes! An 8-bit loop counter that starts at zero will wrap back to zero 256 decrements (or increments) later.
mov al, 0 ; uint8_t i = 0. xor ax,ax is the same code size but zeros AH
loop_top: ; do {
dec al
jnz loop_top ; }while(--i != 0)
Nothing in the question said there needed to be any work inside the loop; this is just an empty delay loop.
Efficiency notes: dec ax is smaller than dec al, and loop rel8 is even more compact than dec/jnz. So if you were optimizing for real 8086 or 8088, you'd want to keep the loop body smaller because it runs more times than the code ahead of the loop. Of course if you actually wanted to just delay, this would delay longer since code-fetch would take more memory accesses. Overall code size is the same either way, for mov ax, 256 (3 bytes) vs. xor ax,ax (2 bytes) or mov al, 0 (2 bytes).
This works the same with any 8-bit register; AL isn't special for any of these instructions, so you'd often want to keep it free for stuff that can benefit from its special encodings for stuff like cmp al, imm8 in 2 bytes instead of the usual 3.
(mov al, 0 vs. xor al,al - false dependency either way on many modern CPUs. mov ah,0 might avoid a false dependency on Skylake; at least mov from another register does but maybe not immediate. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent. Anyway, xor-zeroing is generally not useful on byte registers.)
I'm looking for an efficient-to-unpack (in terms of small number of basic ALU ops in the generated code) way of encoding 3 base-6 digits (i.e. 3 numbers in the range [0,5]) in 8 bits. Only one is needed at a time, so approaches that need to decode all three in order to access one are probably not good unless the cost of decoding all three is very low.
The obvious method is of course:
x = b%6; // 8 insns
y = b/6%6; // 13 insns
z = b/36; // 5 insns
The instruction counts are measured on x86_64 with gcc>=4.8 which knows how to avoid divs.
Another method (using a different encoding) is:
b *= 6
x = b>>8;
b &= 255;
b *= 6
y = b>>8;
b &= 255;
b *= 6
z = b>>8;
This encoding has more than one representation for many tuples (it uses the whole 8bit range rather than just [0,215]) and appears more efficient if you want all 3 outputs, but wasteful if you only want one.
Are there better approaches?
Target language is C but I've tagged this assembly as well since answering requires some consideration of the instructions that would be generated.
As discussed in comments, a LUT would be excellent if it stays hot in cache. uint8_t LUT[3][256] would need the selector scaled by 256, which takes an extra instruction if it's not a compile-time constant. Scaling by 216 to pack the LUT better is only 1 or 2 instructions more expensive. struct3 LUT[216] is nice, where the struct has a 3-byte array member. On x86, this compiles extremely well in position-dependent code where the LUT base can be a 32-bit absolute as part of the addressing mode (if the table is static):
struct { uint8_t vals[3]; } LUT[216];
unsigned decode_LUT(uint8_t b, unsigned selector) {
return LUT[b].vals[selector];
}
gcc7 -O3 on Godbolt for x86-64 and AArch64
movzx edi, dil
mov esi, esi # zero-extension to 64-bit: goes away when inlining.
lea rax, LUT[rdi+rdi*2] # multiply by 3 and add the base
movzx eax, BYTE PTR [rax+rsi] # then index by selector
ret
Silly gcc used a 3-component LEA (3 cycle latency and runs on fewer ports) instead of using LUT as a disp32 for the actual load (no extra latency for an indexed addressing mode, I think).
This layout has the added advantage of locality if you ever need to decode multiple components of the same byte.
In PIC / PIE code, this costs 2 extra instructions, unfortunately:
movzx edi, dil
lea rax, LUT[rip] # RIP-relative LEA instead of absolute as part of another addressing mode
mov esi, esi
lea rdx, [rdi+rdi*2]
add rax, rdx
movzx eax, BYTE PTR [rax+rsi]
ret
But that's still cheap, and all the ALU instructions are single-cycle latency.
Your 2nd ALU unpacking strategy is promising. I thought at first we could use a single 64-bit multiply to get b*6, b*6*6, and b*6*6*6 in different positions of the same 64-bit integer. (b * ((6ULL*6*6<<32) + (36<<16) + 6)
But the upper byte of each multiply result does depend on masking back to 8-bit after each multiply by 6. (If you can think of a way to not require that, one multiple and shift would be very cheap, especially on 64-bit ISAs where the entire 64-bit multiply result is in one register).
Still, x86 and ARM can multiply by 6 and mask in 3 cycles of latency, the same or better latency than a multiply, or less on Intel CPUs with zero-latency movzx r32, r8, if the compiler avoids using parts of the same register for movzx.
add eax, eax ; *2
lea eax, [rax + rax*2] ; *3
movzx ecx, al ; 0 cycle latency on Intel
.. repeat for next steps
ARM / AArch64 is similarly good, with add r0, r0, r0 lsl #1 for multiply by 3.
As a branchless way to select one of the three, you could consider storing (from ah / ch / ... to get the shift for free) to an array, then loading with the selector as the index. This costs store/reload latency (~5 cycles), but is cheap for throughput and avoids branch misses. (Possibly a 16-bit store and then a byte reload would be good, scaling the selector in the load address and adding 1 to get the high byte, saving an extract instruction before each store on ARM).
This is in fact what gcc emits if you write it this way:
unsigned decode_ALU(uint8_t b, unsigned selector) {
uint8_t decoded[3];
uint32_t tmp = b * 6;
decoded[0] = tmp >> 8;
tmp = 6 * (uint8_t)tmp;
decoded[1] = tmp >> 8;
tmp = 6 * (uint8_t)tmp;
decoded[2] = tmp >> 8;
return decoded[selector];
}
movzx edi, dil
mov esi, esi
lea eax, [rdi+rdi*2]
add eax, eax
mov BYTE PTR -3[rsp], ah # store high half of mul-by-6
movzx eax, al # costs 1 cycle: gcc doesn't know about zero-latency movzx?
lea eax, [rax+rax*2]
add eax, eax
mov BYTE PTR -2[rsp], ah
movzx eax, al
lea eax, [rax+rax*2]
shr eax, 7
mov BYTE PTR -1[rsp], al
movzx eax, BYTE PTR -3[rsp+rsi]
ret
The first store's data is ready 4 cycles after the input to the first movzx, or 5 if you include the extra 1c of latency for reading ah when it's not renamed separately on Intel HSW/SKL. The next 2 stores are 3 cycles apart.
So the total latency is ~10 cycles from b input to result output, if selector=0. Otherwise 13 or 16 cycles.
Measuring a number of different approaches in-place in the function that needs to do this, the practical answer is really boring: it doesn't matter. They're all running at about 50ns per call, and other work is dominating. So for my purposes, the approach that pollutes the cache and branch predictors the least is probably the best. That seems to be:
(b * (int[]){2048,342,57}[i] >> 11) % 6;
where b is the byte containing the packed values and i is the index of the value wanted. The magic constants 342 and 57 are just the multiplicative constants GCC generates for division by 6 and 36, respectively, scaled to a common shift of 11. The final %6 is spurious in the /36 case (i==2) but branching to avoid it does not seem worthwhile.
On the other hand, if doing this same work in a context where there wasn't an interface constraint to have the surrounding function call overhead per lookup, I think an approach like Peter's would be preferable.
How to coax the GCC compiler to emit the REPE CMPSB instruction in plain C, without the "asm" and "_emit" keywords, calls to an included library and compiler intrinsics?
I tried some C code like the one listed below, but unsuccessfully:
unsigned int repe_cmpsb(unsigned char *esi, unsigned char *edi, unsigned int ecx) {
for (; ((*esi == *edi) && (ecx != 0)); esi++, edi++, ecx--);
return ecx;
}
See how GCC compiles it at this link:
https://godbolt.org/g/obJbpq
P.S.
I realize that there are no guarantees that the compiler compiles a C code in a certain way, but I'd like to coax it anyway for fun and just to see how smart it is.
rep cmps isn't fast; it's >= 2 cycles per count throughput on Haswell, for example, plus startup overhead. (http://agner.org/optimize). You can get a regular byte-at-a-time loop to go at 1 compare per clock (modern CPUs can run 2 loads per clock) even when you have to check for a match and for a 0 terminator, if you write it carefully.
InstLatx64 numbers agree: Haswell can manage 1 cycle per byte for rep cmpsb, but that's total bandwidth (i.e. 2 cycles to compare 1 byte from each string).
Only rep movs and rep stos have "fast strings" support in current x86 CPUs. (i.e. microcoded implementations that internally use wider loads/stores when alignment and lack of overlap allow.)
The "smart" thing for modern CPUs is to use SSE2 pcmpeqb / pmovmskb. (But gcc and clang don't know how to vectorize loops with an iteration count that isn't known before loop entry; i.e. they can't vectorize search loops. ICC can, though.)
However, gcc will for some reason inline repz cmpsb for strcmp against short fixed strings. Presumably it doesn't know any smarter patterns for inlining strcmp, and the startup overhead may still be better than the overhead of a function call to a dynamic library function. Or maybe not, I haven't tested. Anyway, it's not horrible for code size in a block of code that compares something against a bunch of fixed strings.
#include <string.h>
int string_equal(const char *s) {
return 0 == strcmp(s, "test1");
}
gcc7.3 -O3 output from Godbolt
.LC0:
.string "test1"
string_cmp:
mov rsi, rdi
mov ecx, 6
mov edi, OFFSET FLAT:.LC0
repz cmpsb
setne al
movzx eax, al
ret
If you don't booleanize the result somehow, gcc generates a -1 / 0 / +1 result with seta / setb / sub / movzx. (Causing a partial-register stall on Intel before IvyBridge, and a false dependency on other CPUs, because it uses 32-bit sub on the setcc results, /facepalm. Fortunately most code only needs a 2-way result from strcmp, not 3-way).
gcc only does this with fixed-length string constants, otherwise it wouldn't know how to set rcx.
The results are totally different for memcmp: gcc does a pretty good job, in this case using a DWORD and a WORD cmp, with no rep string instructions.
int cmp_mem(const char *s) {
return 0 == memcmp(s, "test1", 6);
}
cmp DWORD PTR [rdi], 1953719668 # 0x74736574
je .L8
.L5:
mov eax, 1
xor eax, 1 # missed optimization here after the memcmp pattern; should just xor eax,eax
ret
.L8:
xor eax, eax
cmp WORD PTR [rdi+4], 49 # check last 2 bytes
jne .L5
xor eax, 1
ret
Controlling this behaviour
The manual says that -mstringop-strategy=libcall should force a library call, but it doesn't work. No change in asm output.
Neither does -mno-inline-stringops-dynamically -mno-inline-all-stringops.
It seems this part of the GCC docs is obsolete. I haven't investigated further with larger string literals, or fixed size but non-constant strings, or similar.
I've disassembled this c code (using ida), and ran across this bit of code. I believe the second line is an array, as well as the 5th line, but I'm not sure why it uses a sign extend or a zero extend.
I need to convert the code to C, and I'm not sure why the sign/zero extend is used, or what C code would cause that.
mov ecx, [ebp+var_58]
mov dl, byte ptr [ebp+ecx*2+var_28]
mov [ebp+var_59], dl
mov eax, [ebp+var_58]
movsx ecx, [ebp+eax*2+var_20]
movzx edx, [ebp+var_59]
or edx, ecx
mov [ebp+var_59], dl
unsigned integer types will be zero-extended, while signed types will be sign-extended.
I kinda want to downvote this as too trivial. It's not like there's anything going on that the instruction reference manual doesn't cover. I guess it's different from asking for an explanation of a really simple C program because the trick here is understanding why one might string this sequence of instructions together, rather than just what each one does individually. Being familiar with the idioms used by non-optimizing compilers (store and reload from RAM after every statement) helps.
I'm guessing this is a snippet from inside a function that makes a stack frame, so positive offsets from ebp are where local variables are spilled when they're not live in registers.
mov ecx, [ebp+var_58] ; load var58 into ecx
mov dl, byte ptr [ebp+ecx*2+var_28] ; load a byte from var28[2*var58]
mov [ebp+var_59], dl ; store it to var59
mov eax, [ebp+var_58] ; load var58 again for some reason? can var59 alias var58?
; otherwise we still have the value in ecx, right?
; Or is this non-optimizing compiler output that's really annoying to read?
movsx ecx, [ebp+eax*2+var_20] ; load var20[var58*2]
movzx edx, [ebp+var_59] ; load var59 again
or edx, ecx ; edx = var59|var20[var58*2]
mov [ebp+var_59], dl ; spill var59 back to memory
I guess the default operand size for movsx/movzx is byte-to-dword. word-to-dword also exists, and I'm surprised your disassembler didn't disambiguate with a byte ptr on the memory operand. I'm inferring that it's a byte load because the preceding store to that address was byte-wide.
movsx is used when loading signed data that's smaller than 32b. C's integer-promotion rules dictate that operations on integer types smaller than int are automatically promoted to int (or unsigned int if int can't represent all values. e.g. if unsigned short and unsigned int are the same size).
8bit or 32bit operand sizes are available without operand-size prefix bytes. Some only Intel P6/SnB family CPUs track partial-register dependencies, sign-extending to a full register width on loads can make for faster code (avoiding false dependencies on the previous contents of the register on AMD and Silvermont). So sign or zero extending (as appropriate for the data type) on loads is often the best way to handle narrow memory locations.
Looking at the output of non-optimizing compilers is not usually worth bothering with.
If the code had been generated by a proper optimizing compiler, it would probably be more like
mov ecx, [ebp+var_58] ; var58 is live in ecx
mov al, byte ptr [ebp+ecx*2+var_28] ; var59 = var28[2*var58]
or al, [ebp+ecx*2+var_20] ; var59 |= var20[var58*2]
mov [ebp+var_59], al ; spill var59 to memory
Much easier to read, IMO, without the noise of constantly storing/reloading. You can see when a value is used multiple times without having to notice that a load was from an address that was just stored to.
If a false dependency on the upper 24 bits of eax was causing a problem, we could use movzx or movsx loads into two registers, and do an or r32, r32 like the original, but then still store the low 8. (Using a 32bit or with a memory operand would do a 4B load, not a 1B load, which could cross a cache line or even a page and segfault.)
I been debugging REP STOS DWORD PTR ES:[EDI] for a while now
From my conclusion it always uses
ECX as counter.
EAX as the value that will be copied over EDI and then appended ECX times
so after putting in the pointed dump of EDI
it seems to overwrite the pointed data at EDI with what's
it seems it always only uses ECX as a counter, while changing EDI by 4 bytes.
it stops working when counter hits 0
So I came up with this kind of code
while(regs.d.ecx != 0)
{
*(unsigned int *)(regs.d.edi) = regs.d.eax;
regs.d.edi += 4;
regs.d.ecx--;
}
Seems to work.. but i'm concerned since I just did this by luck and guess work. Is it solid? like will it always be ECX as counter, EAX as data, and it always copies 4 bytes never less?
You are almost correct. The only difference is that the direction flag (DF) controls whether 4 is added or subtracted from EDI (and it actually is offset from the ES segment base, but you probably don't care about that):
for (; regs.d.ecx != 0; regs.d.ecx--)
{
*(unsigned int *)(regs.d.edi) = regs.d.eax;
regs.d.edi += regs.eflags.df ? -4 : 4;
}
Note that the for (; regs.d.ecx != 0; regs.d.ecx--) { } is the action of the REP prefix, and the body of the loop is the action of STOS DWORD....
Since you are asking a lot of these questions, I think you will find the Intel 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A and 2B to be useful. These contain descriptions of each instruction and prefix, including pseudo-code descriptions.