I'm working with a small UART device and frequently need to switch the baud rate at which it operates.
Essentially the whole set-up boils down to
#define FOSC 2000000
#define BAUD 9600
uint8_t rate = (uint8_t) ((FOSC / (16.0 * BAUD)) - 1 + 0.5);
(Where +0.5 is used to round the result.)
I'm currently compiling with gcc 4.8.1, -O1.
Does the compiler optimize away the whole cast or am I left with a cast followed by a constant? Would this differ with different -O# values (besides -O0)? What about -Os (which I might have to compile with eventually)?
If it matters, I'm developing for the Atmel AT90USB647 (or the datasheet [pdf]).
It is extremely likely that any sane compiler will convert that entire expression (including the cast) into a constant when compiling with optimizations enabled.
However, to be sure, you'll need to look at the assembly output of your compiler.
But what about GCC 4.8.1 in particular?
Code
#include <stdint.h>
#include <stdio.h>
#define FOSC 2000000
#define BAUD 9600
int main() {
uint8_t rate = (uint8_t) (FOSC / (16.0 * BAUD)) - 1 + 0.5;
printf("%u", rate);
}
Portion of the generated assembly with gcc -O1 red.c
main:
.LFB11:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $12, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
We can see clearly that gcc has precomputed the value of 12 for rate.
Atmel AVR ships newlib C library which is a simple ANSI C library, math library, and collection of board support packages. You may refer to ANSI C specifications to find out. Specifically the Conversion section.
I would make sure that rate is being written to a volatile variable or pointer. Maybe the rate value is calculated OK, but when it's written to the peripheral destination it is lacking the volatile tag and the optimizer does not perform the write operation.
Related
How would you define a pointer to a XMM register in asm()?
Like accessing array elements in a loop how can you access registers in asm using a counter?
I tried to do it in the following code:
float *f=(float*)_aligned_malloc(64,16);
for(int i=0;i<4;i++)
asm volatile
(
"movaps (%1),%%xmm%0"
:
:"r"(i),"r"(f+4*i)
:"%xmm%0"
);
But the compiler gives me this error:
unknown register name '%xmm%0' in 'asm'
This sounds like a horrible idea compared to using assembler macros or actually manual unrolling. Your code would totally break if gcc decided not to fully unroll the loop, because it can only work with compile-time constant indexing.
Also, there's no way to tell the compiler which register you're putting the result in, so this is basically useless. I'm only answering as a silly exercise in using GNU C inline-asm syntax, not because this answer is possibly useful in any project.
That said, you can do it using an "i" constraint and a c operand modifier to format the immediate as a bare number, like 1 instead of $1.
void *_aligned_malloc(int, int);
void foo()
{
float *f=(float*)_aligned_malloc(64,16);
for(int i=0;i<4;i++) {
asm volatile (
"movaps %[input],%%xmm%c[regnum]"
:
// only compiles with optimization enabled.
:[regnum] "i"(i), [input] "m"(f[4*i])
:"%xmm0", "%xmm1", "%xmm2", "%xmm3"
);
}
}
gcc and clang, with -O3, are able to fully unroll and make i for each iteration a compile-time constant that can match an "i" constraint. This compiles on Godbolt.
# gcc7.3 -O3
foo():
subq $8, %rsp
movl $16, %esi
movl $64, %edi
call _aligned_malloc(int, int) # from a dummy prototype so it compiles
movaps (%rax),%xmm0
movaps 16(%rax),%xmm1 # compiler can use addressing modes because I switched to an "m" constraint
movaps 32(%rax),%xmm2
movaps 48(%rax),%xmm3
vzeroupper # XMM clobbers also include YMM, and I guess gcc assumes you might have dirtied the upper lanes.
addq $8, %rsp
ret
Note that I've only told the compiler about reading the first float of every group of 4.
ICC -O3 says catastrophic error: Cannot match asm operand constraint even with -O3. With optimization disabled, gcc and clang have the same problem, of course. For example, gcc -O0 will say:
<source>: In function 'void foo()':
<source>:11:10: warning: asm operand 0 probably doesn't match constraints
);
^
<source>:11:10: error: impossible constraint in 'asm'
Compiler returned: 1
Because without optimization, i isn't a compile-time constant and can't match an "i" (immediate) constraint.
Obviously you can't use an "r" constraint; that would fill in the asm template with something like %xmm%eax if the compiler picked eax.
Anyway, this is useless because you can't use destination register. All you can do is tell the compiler that all of the possible destination registers are clobbered. It's not safe to write to a clobbered register in one asm statement and then assume the value is still there in a later asm statement.
x86, like all other architectures, can't index the architectural registers using a runtime value. Register numbers must be hard-coded into the instruction stream.
(Some microcontrollers, like AVR, have memory-mapped registers, so you can index them by indexing the memory that aliases the register file. But this is rare, and x86 doesn't do it. It would interfere with out-of-order execution in a similar way to self-modifying code. And BTW, SMC (or branching to one of 16 different versions of an instruction) is the only option for runtime indexing of the register file.)
You can't -- there is no way to index into the register file.
If you want to use multiple registers in sequence, you will need to unroll the loop and name each of the registers explicitly.
I am trying to write the rotate left operation in C using inline assembly, like so:
byte rotate_left(byte a) {
__asm__("rol %0, $1": "=a" (a) : "a" (a));
return a;
}
(Where byte is typedefed as unsigned char).
This raises the error
/tmp/ccKYcEHR.s:363: Error: operand size mismatch for `rol'.
What is the problem here?
AT&T syntax uses the opposite order from Intel syntax. The rotate count has to be first, not last: rol $1, %0.
Also, you don't need and shouldn't use inline asm for this: https://gcc.gnu.org/wiki/DontUseInlineAsm
As described in Best practices for circular shift (rotate) operations in C++, GNU C has intrinsics for narrow rotates, because the rotate-idiom recognition code fails to optimize away an and of the rotate count. x86 shifts/rotates mask the count with count & 31 even for 8-bit and 16-bit, but rotates still wrap around. It does matter for shifts though.
Anyway, gcc has a builtin function for narrow rotates to avoid any overhead. There's a __rolb wrapper for it in x86intrin.h, but MSVC uses its own __rotr8 and so on from its intrin.h. Anyway, clang doesn't support either the __builtin or the x86intrin.h wrappers for rotates, but gcc and ICC do.
#include <stdint.h>
uint8_t rotate_left_byte_by1(uint8_t a) {
return __builtin_ia32_rolqi(a, 1); // qi = quarter-integer
}
I used uint8_t from stdint.h like a normal person instead of defining a byte type.
This doesn't compile at all with clang, but it compiles as you'd hope with gcc7.2:
rotate_left_byte_by1:
movl %edi, %eax
rolb %al
ret
This gives you a function that compiles as efficiently as your inline asm ever could, but which can optimize away completely for compile-time constants, and the compiler knows how it works / what it does and can optimize accordingly.
I have some code that rotates my data. I know GAS syntax has a single assembly instruction that can rotate an entire byte. However, when I try to follow any of the advice on Best practices for circular shift (rotate) operations in C++, my C code compiles into at least 5 instructions, which use up three registers-- even when compiling with -O3. Maybe those are best practices in C++, and not in C?
In either case, how can I force C to use the ROR x86 instruction to rotate my data?
The precise line of code which is not getting compiled to the rotate instruction is:
value = (((y & mask) << 1 ) | (y >> (size-1))) //rotate y right 1
^ (((z & mask) << n ) | (z >> (size-n))) // rotate z left by n
// size can be 64 or 32, depending on whether we are rotating a long or an int, and
// mask would be 0xff or 0xffffffff, accordingly
I do not mind using __asm__ __volatile__ to do this rotate, if that's what I must do. But I don't know how to do so correctly.
Your macro compiles to a single ror instruction for me... specifically, I compiled this test file:
#define ROR(x,y) ((unsigned)(x) >> (y) | (unsigned)(x) << 32 - (y))
unsigned ror(unsigned x, unsigned y)
{
return ROR(x, y);
}
as C, using gcc 6, with -O2 -S, and this is the assembly I got:
.file "test.c"
.text
.p2align 4,,15
.globl ror
.type ror, #function
ror:
.LFB0:
.cfi_startproc
movl %edi, %eax
movl %esi, %ecx
rorl %cl, %eax
ret
.cfi_endproc
.LFE0:
.size ror, .-ror
.ident "GCC: (Debian 6.4.0-1) 6.4.0 20170704"
.section .note.GNU-stack,"",#progbits
Please try to do the same, and report the assembly you get. If your test program is substantially different from mine, please tell us how it differs. If you are using a different compiler or a different version of GCC please tell us exactly which one.
Incidentally, I get the same assembly output when I compile the code in the accepted answer for "Best practices for circular shift (rotate) operations in C++", as C.
How old is your compiler? As I noted in the linked question, the UB-safe variable-count rotate idiom (with extra & masking of the count) confuses old compilers, like gcc before 4.9. Since you're not masking the shift count, it should be recognized with even older gcc.
Your big expression is maybe confusing the compiler. Write an inline function for rotate, and call it, like
value = rotr32(y & mask, 1) ^ rotr32(z & mask, n);
Much more readable, and may help stop the compiler from trying to do things in the wrong order and breaking the idiom before recognizing it as a rotate.
Maybe those are best practices in C++, and not in C?
My answer on the linked question clearly says that it's the best practice for C as well as C++. They are different languages, but they overlap completely for this, according to my testing.
Here's a version of the Godbolt link using -xc to compile as C, not C++. I had a couple C++isms in the link in the original question, for experimenting with integer types for the rotate count.
Like the original linked from the best-practices answer, it has a version that uses x86 intrinsics if available. clang doesn't seem to provide any in x86intrin.h, but other compilers have _rotl / _rotr for 32-bit rotates, with other sizes available.
Actually, I talked about rotate intrinsics at length in the answer on the best-practices question, not just in the godbolt link. Did you even read the answer there, apart from the code block? (If you did, your question doesn't reflect it.)
Using intrinsics, or the idiom in your own inline function, is much better than using inline asm. Asm defeats constant-propagation, among other things. Also, compilers can use BMI2 rorx dst, src, imm8 to copy-and-rotate with one instruction, if you compile with -march=haswell or -mbmi2. It's a lot harder to write an inline-asm rotate that can use rorx for immediate-count rotates but ror r32, cl for variable-count rotates. You could try with _builtin_constant_p(), but clang evaluates that before inlining, so it's basically useless for meta-programming style choice of which code to use. It works with gcc though. But it's still much better not to use inline asm unless you've exhausted all other avenues (like asking on SO) to avoid it. https://gcc.gnu.org/wiki/DontUseInlineAsm
Fun fact: the rotate functions in gcc's x86intrin.h are just pure C using the rotate idiom that gcc recognizes. Except for 16-bit rotates, where they use __builtin_ia32_rolhi.
You might need to be a bit more specific with what integral type / width you're rotating, and whether you have a fixed or variable rotation. ror{b,w,l,q} (8, 16, 32, 64-bit) has forms for (1), imm8, or the %cl register. As an example:
static inline uint32_t rotate_right (uint32_t u, size_t r)
{
__asm__ ("rorl %%cl, %0" : "+r" (u) : "c" (r));
return u;
}
I haven't tested this, it's just off the top of my head. And I'm sure multiple constraint syntax could be used to optimize cases where a constant (r) value is used, so %e/rcx is left alone.
If you're using a recent version of gcc or clang (or even icc). The intrinsics header <x86intrin.h>, may provide __ror{b|w|d|q} intrinsics. I haven't tried them.
Best Way:
#define rotr32(x, n) (( x>>n ) | (x<<(64-n)))
#define rotr64(x, n) (( x>>n ) | (x<<(32-n)))
More generic:
#define rotr(x, n) (( x>>n ) | (x<<((sizeof(x)<<3)-n)))
And it compiles (in GCC) with exactly the same code as the asm versions below.
For 64 bit:
__asm__ __volatile__("rorq %b1, %0" : "=g" (u64) : "Jc" (cShift), "0" (u64));
or
static inline uint64_t CC_ROR64(uint64_t word, int i)
{
__asm__("rorq %%cl,%0"
:"=r" (word)
:"0" (word),"c" (i));
return word;
}
I'm trying to understand how bitwise operations are handled by C/C++ compilers.
Specifically, I'm talking about C compiled with gcc, but I believe that the question is a bit more general than that.
Anyway, suppose I have a macros defined as follows:
#define SOME_CONSTANT 0x111UL
#define SOME_OFFSET 2
#define SOME_MASK 7
#define SOME_VALUE ((SOME_CONSTANT) << (SOME_OFFSET)) & (SOME_MASK)
static inline void foo() { printf("Value: %lu#n", SOME_VALUE); }
All the ingredients of SOME_VALUE are constants, and they are all known at compile time.
So my question is: will gcc evaluate SOME_VALUE at compile time, or will it be done at runtime only?
How do I check whether a gcc supports such optimization?
Your compiler does not know about SOME_VALUE. The C code is first passed through the C Preprocessor to C compiler. You can see the output of the C Preprocessor by running gcc as:
gcc -E code.c
You'll see that the real code fed to C compiler is:
int main(void) {
printf("Value: %lu#n", ((0x111UL) << (2)) & (7));
return 0;
}
So the question becomes "Does C Compiler of GCC optimize ((0x111UL) << (2)) & (7)", and the answer is yes (as indicated by other answerers who proved it by looking at the assembly code generated).
Yes, gcc will optimise this as it is a completely constant expression.
To check this look at the assembly code, for example with this tool https://gcc.godbolt.org/
#include <stdio.h>
#define SOME_CONSTANT 0x111UL
#define SOME_OFFSET 2
#define SOME_MASK 7
#define SOME_VALUE ((SOME_CONSTANT) << (SOME_OFFSET)) & (SOME_MASK)
void foo() { printf("Value: %lu#n", SOME_VALUE); }
I had to modify your code slightly as otherwise gcc optimises away the whole thing and leaves nothing!
.LC0:
.string "Value: %lu#n"
foo():
movl $4, %esi
movl $.LC0, %edi
xorl %eax, %eax
jmp printf
will gcc evaluate SOME_VALUE at compile time
I don't know about yours, mine does
How do I check whether a gcc supports such optimization?
I used -S flag to generate assembly code and checked it
movl $4, %esi
As others have answered, yes, it will. But do consider that is not required to; if you want the certainty of that, just pre-calculate it as you have all the elements to do it.
I had been struggling for weeks with a poor-performing translator I had written.
On the following simple bechmark
#include<stdio.h>
int main()
{
int x;
char buf[2048];
FILE *test = fopen("test.out", "wb");
setvbuf(test, buf, _IOFBF, sizeof buf);
for(x=0;x<1024*1024; x++)
fprintf(test, "%04d", x);
fclose(test);
return 0
}
we see the following result
bash-3.1$ gcc -O2 -static test.c -o test
bash-3.1$ time ./test
real 0m0.334s
user 0m0.015s
sys 0m0.016s
As you can see, the moment the "-std=c99" flag is added in, performance comes crashing down:
bash-3.1$ gcc -O2 -static -std=c99 test.c -o test
bash-3.1$ time ./test
real 0m2.477s
user 0m0.015s
sys 0m0.000s
The compiler I'm using is gcc 4.6.2 mingw32.
The file generated is about 12M, so this is a difference between of about 21MB/s between the two.
Running diff shows the the generated files are identical.
I assumed this has something to do with file locking in fprintf, of which the program makes heavy use, but I haven't been able to find a way to switch that off in the C99 version.
I tried flockfile on the stream I use at the beginning of the program, and an corresponding funlockfile at the end, but was greeted with compiler errors about implicit declarations, and linker errors claiming undefined references to those functions.
Could there be another explanation for this problem, and more importantly, is there any way to use C99 on windows without paying such an enormous performance price?
Edit:
After looking at the code generated by these options, it looks like in the slow versions, mingw sticks in the following:
_fprintf:
LFB0:
.cfi_startproc
subl $28, %esp
.cfi_def_cfa_offset 32
leal 40(%esp), %eax
movl %eax, 8(%esp)
movl 36(%esp), %eax
movl %eax, 4(%esp)
movl 32(%esp), %eax
movl %eax, (%esp)
call ___mingw_vfprintf
addl $28, %esp
.cfi_def_cfa_offset 4
ret
.cfi_endproc
In the fast version, this simply does not exist; otherwise, both are exactly the same. I assume __mingw_vfprintf seems to be the slowpoke here, but I have no idea what behavior it needs to emulate that makes it so slow.
After some digging in the source code, I have found why the MinGW function is so terribly slow:
At the beginning of a [v,f,s]printf in MinGW, there is some innocent-looking initialization code:
__pformat_t stream = {
dest, /* output goes to here */
flags &= PFORMAT_TO_FILE | PFORMAT_NOLIMIT, /* only these valid initially */
PFORMAT_IGNORE, /* no field width yet */
PFORMAT_IGNORE, /* nor any precision spec */
PFORMAT_RPINIT, /* radix point uninitialised */
(wchar_t)(0), /* leave it unspecified */
0, /* zero output char count */
max, /* establish output limit */
PFORMAT_MINEXP /* exponent chars preferred */
};
However, PFORMAT_MINEXP is not what it appears to be:
#ifdef _WIN32
# define PFORMAT_MINEXP __pformat_exponent_digits()
# ifndef _TWO_DIGIT_EXPONENT
# define _get_output_format() 0
# define _TWO_DIGIT_EXPONENT 1
# endif
static __inline__ __attribute__((__always_inline__))
int __pformat_exponent_digits( void )
{
char *exponent_digits = getenv( "PRINTF_EXPONENT_DIGITS" );
return ((exponent_digits != NULL) && ((unsigned)(*exponent_digits - '0') < 3))
|| (_get_output_format() & _TWO_DIGIT_EXPONENT)
? 2
: 3
;
}
This winds up getting called every time I want to print, and getenv on windows must not be very quick. Replacing that define with a 2 brings the runtime back to where it should be.
So, the answer comes down to this: when using -std=c99 or any ANSI-compliant mode, MinGW switches the CRT runtime with its own. Normally, this wouldn't be an issue, but the MinGW lib had a bug which slowed its formatting functions down far beyond anything imaginable.
Using -std=c99 disable all GNU extensions.
With GNU extensions and optimization, your fprintf(test, "B") is probably replaced by a fputc('B', test)
Note this answer is obsolete, see https://stackoverflow.com/a/13973562/611560 and https://stackoverflow.com/a/13973933/611560
After some consideration of your assembler, it looks like the slow version is using the *printf() implementation of MinGW, based undoubtedly in the GCC one, while the fast version is using the Microsoft implementation from msvcrt.dll.
Now, the MS one is notably for lacking a lot of features, that the GCC one does implement. Some of these are GNU extensions but some others are for C99 conformance. And since you are using -std=c99 you are requesting the conformance.
But why so slow? Well, one factor is simplicity, the MS version is far simpler so it is expected that it will run faster, even in the trivial cases. Other factor is that you are running under Windows, so it is expected that the MS version be more efficient that one copied from the Unix world.
Does it explain a factor of x10? Probably not...
Another thing you can try:
Replace fprintf() with sprintf(), printing into a memory buffer without touching the file at all. Then you can try doing fwrite() without printfing. That way you can guess if the loss is in the formatting of the data or in the writing to the FILE.
Since MinGW32 3.15, compliant printf functions are available to use instead of those found in Microsoft C runtime (CRT).
The new printf functions are used when compiling in strict ANSI, POSIX and/or C99 modes.
For more information see the mingw32 changelog
You can use __msvcrt_fprintf() to use the fast (non compliant) function.