I have read tons of similar questions about this stuff here but still can't solve my problem.
I need to use this piece of assembly code in C file:
unsigned char * led = (char *) 0xC0;
for (int i = 0; i < 10; i++)
{
__asm
{
ror [led[i]],1
};
}
I just need to rotate all values in led variable by 1 bit. While using this code I get an Factor expected error. When i remove [] brackets around led value, it is end of line expected error.
Strange thing is the code bellow works perfectly:
__asm
{
nop
};
As you can see these samples look very similar but there is only one working properly. What am I doing wrong?
If it is relevant i try to make an easy application in CodeWarrior, which is IDE for embedded software development.
Related
I'm debugging C code running on a MSP430 microprocessor using GDB.
When I set a breakpoint on the line double average = sum / 10; using break 172, it confirms by responding Breakpoint 1 at 0xc01c: file main.c, line 172, but when I continue with c, the code runs until it hits Breakpoint 1, main () at main.c:184.
I wasn't having issues debugging until recently, so I tried reverting everything to the previous version and I still have this issue. I have also tried:
Turning my laptop off and on.
Unplugging and re-plugging every cable related to the microprocessor and its circuit.
Closing and re-opening all terminal windows.
Re-compiling and re-loading my C code into the microprocessor.
Print statements to help debugging aren't an option because the microprocessor can't hold #include <stdio.h>.
Clearing all breakpoints present before setting this one, but none are found.
The code looks something like:
void main(void)
{
OtherMethod();
while(1)
{
int sum = 0;
for(int i = 0; i < 10; i++)
{
sum += i;
}
double average = sum / 10; // Line 172
}
}
void OtherMethod(void)
{
P1DIR |= LED1 + LED2; // Line 184
}
Other information that might be helpful is that I can successfully set a breakpoint on the line sum += i;.
Any ideas are appreciated.
If you compile with optimization, several "strange" things might happen, see your compiler's documentation. This might lead to statements being removed or re-arranged, and when debugging, surprising behaviour.
To debug a program "by the line", compile without optimization.
Or live with the surprises; it's a source of delight, in any case.
Using Atmel studio 7, with STK600 and 32UC3C MCU
I'm pulling my hair over this.
I'm sending strings of a variable size over UART once every 5 seconds. The String consists of one letter as opcode, then two chars are following that tell the lenght of the following datastring (without the zero, there is never a zero at the end of any of those strings). In most cases the string will be 3 chars in size, because it has no data ("p00").
After investigation I found out that what supposed to be "p00" was in fact "0p0" or "00p" or (only at first try after restarting the micro "p00"). I looked it up in the memory view of the debugger. Then I started hTerm and confirmed that the data was in fact "p00". So after a while hTerm showed me "p00p00p00p00p00p00p00..." while the memory of my circular uart buffer reads "p000p000p0p000p0p000p0p0..."
edit: Actually "0p0" and "00p" are alternating.
The baud rate is 9600. In the past I was only sending single letters. So everything was running well.
This is the code of the Receiver Interrupt:
I tried different variations in code that were all doing the same in a different way. But all of them showed the exact same behavior.
lastWebCMDWritePtr is a uint8_t* type and so is lastWebCMDRingstartPtr.
lastWebCMDRingRXLen is a uint8_t type.
__attribute__((__interrupt__))
void UartISR_forWebserver()
{
*(lastWebCMDWritePtr++) = (uint8_t)((&AVR32_USART0)->rhr & 0x1ff);
lastWebCMDRingRXLen++;
if(lastWebCMDWritePtr - lastWebCMDRingstartPtr > lastWebCMDRingBufferSIZE)
{
lastWebCMDWritePtr = lastWebCMDRingstartPtr;
}
// Variation 2:
// advanceFifo((uint8_t)((&AVR32_USART0)->rhr & 0x1ff));
// Variation 3:
// if(usart_read_char(&AVR32_USART0, getReadPointer()) == USART_RX_ERROR)
// {
// usart_reset_status(&AVR32_USART0);
// }
//
};
I welcome any of your ideas and advices.
Regarts Someo
P.S. I put the Atmel studio tag in case this has something to do with the myriad of debugger bugs of AS.
For a complete picture you would have to show where and how lastWebCMDWritePtr, lastWebCMDRingRXLen, lastWebCMDRingstartPtr and lastWebCMDRingBufferSIZE are used elsewhere (on the consuming side)
Also I would first try a simpler ISR with no dependencies to other software modules to exclude a hardware resp. register handling problem.
Approach:
#define USART_DEBUG
#define DEBUG_BUF_SIZE 30
__attribute__((__interrupt__))
void UartISR_forWebserver()
{
uint8_t rec_byte;
#ifdef USART_DEBUG
static volatile uint8_t usart_debug_buf[DEBUG_BUF_SIZE]; //circular buffer for debugging
static volatile int usart_debug_buf_index = 0;
#endif
rec_byte = (uint8_t)((&AVR32_USART0)->rhr & 0x1ff);
#ifdef USART_DEBUG
usart_debug_buf_index = usart_debug_buf_index % DEBUG_BUF_SIZE;
usart_debug_buf[usart_debug_buf_index] = rec_byte;
usart_debug_buf_index++
if (!(usart_debug_buf_index < DEBUG_BUF_SIZE)) {
usart_debug_buf_index = 0; //candidate for a breakpoint to see what happened in the past
}
#endif
//uart_recfifo_enqueue(rec_byte);
};
I'm trying to figure out how to structure the main loop code for a numerical simulation in such a way that the compiler generates nicely vectorized instructions in a compact way.
The problem is most easily explained by a C pseudocode, but I also have a Fortran version which is affected by the same kind of issue. Consider the following loop where lots_of_code_* are some complicated expressions which produces a fair number of machine instructions.
void process(const double *in_arr, double *out_arr, int len)
{
for (int i = 0; i < len; i++)
{
const double a = lots_of_code_a(i, in_arr);
const double b = lots_of_code_b(i, in_arr);
...
const double z = lots_of_code_z(i, in_arr);
out_arr[i] = final_expr(a, b, ..., z);
}
}
When compiled with an AVX target the Intel compiler generates code which goes like
process:
AVX_loop
AVX_code_a
AVX_code_b
...
AVX_code_z
AVX_final_expr
...
SSE_loop
SSE_instructions
...
scalar_loop
scalar_instructions
...
The resulting binary is already quite sizable. My actual calculation loop, though, looks more like the following:
void process(const double *in_arr1, ... , const double *in_arr30,
double *out_arr1, ... double *out_arr30,
int len)
{
for (int i = 0; i < len; i++)
{
const double a1 = lots_of_code_a(i, in_arr1);
...
const double a30 = lots_of_code_a(i, in_arr30);
const double b1 = lots_of_code_b(i, in_arr1);
...
const double b30 = lots_of_code_b(i, in_arr30);
...
...
const double z1 = lots_of_code_z(i, in_arr1);
...
const double z30 = lots_of_code_z(i, in_arr30);
out_arr1[i] = final_expr1(a1, ..., z1);
...
out_arr30[i] = final_expr30(a30, ..., z30);
}
}
This results in a very large binary indeed (400KB for the Fortran version, 800KB for C99). If I now define lots_of_code_* as functions, then each function gets turned into non-vectorized code. Whenever the compiler decides to inline a function it does vectorize it, but seems to also duplicate the code each time as well.
In my mind, the ideal code should look like:
AVX_lots_of_code_a:
AVX_code_a
AVX_lots_of_code_b:
AVX_code_b
...
AVX_lots_of_code_z:
AVX_code_z
SSE_lots_of_code_a:
SSE_code_a
...
scalar_lots_of_code_a:
scalar_code_a
...
...
process:
AVX_loop
call AVX_lots_of_code_a
call AVX_lots_of_code_a
...
SSE_loop
call SSE_lots_of_code_a
call SSE_lots_of_code_a
...
scalar_loop
call scalar_lots_of_code_a
call scalar_lots_of_code_a
...
This clearly results in a much smaller code which is still just as well optimized as the fully-inlined version. With luck it might even fit in L1.
Obviously I can write the this myself using intrinsics or whatever, but is it possible to get the compiler to automatically vectorize in the way described above through "normal" source code?
I understand that the compiler will probably never generate separate symbols for each vectorized version of the functions, but I thought it could still just inline each function once inside process and use internal jumps to repeat the same code block, rather than duplicating code for each input array.
Formal answer to questions like yours:
Consider using OpenMP4.0 SIMD-enabled (I didn't say inlined) functions or equivalent proprietary mechanisms. Available in Intel Compiler or fresh GCC4.9.
See more details here: https://software.intel.com/en-us/node/522650
Example:
//Invoke this function from vectorized loop
#pragma omp declare simd
int vfun(int x, int y)
{
return x*x+y*y;
}
It will give you capability to vectorize loop with function calls without inlining and as a result without huge code generation. (I didn't really explore your code snippet in details; instead I answered the question you asked in textual form)
The immediate problem that comes to mind is the lack of restrict on the input/output-pointers. The input is const though, so it's probably not too much of a problem, unless you have multiple output-pointers.
Other than that, I recommend -fassociative-math or whatever the ICC equivalent is. Structurally, you seem to iterate over the array, doing multiple independent operations on the array that are only munged together in the very end. Strict fp compliance might kill you on the array-operations.Finally, there's probably no way this will get vectorized if you need more intermediate results than vector_registers - input_arrays.Edit:
I think I see your problem now. You call the same function on different data, and want each result stored independently, right?The problem is that the same function always writes to the same output register, so subsequent, vectorized calls would clobber earlier results. The solution could be:A stack of results (either in memory or like the old x87 FPU-stack), that gets pushed every time. If in memory, it is slow, if x87, it's not vectorized. Bad idea.
Effectively multiple functions to write into different registers. Code duplication. Bad idea.Rotating registers, like on the Itanium. You don't have an Itanium? You're not alone.It's possible that this can't be easily vectorized on current architectures. Sorry.
Edit, you're apparently fine with going to memory:
void function1(double const *restrict inarr1, double const *restrict inarr2, \
double *restrict outarr, size_t n)
{
for (size_t i = 0; i<n; i++)
{
double intermediateres[NUMFUNCS];
double * rescursor = intermediateres;
*rescursor++ = mungefunc1(inarr1[i]);
*rescursor++ = mungefunc1(inarr2[i]);
*rescursor++ = mungefunc2(inarr1[i]);
*rescursor++ = mungefunc2(inarr2[i]);
...
outarr[i] = finalmunge(intermediateres[0],...,intermediateres[NUMFUNCS-1]);
}
}
This might be vectorizable. I don't think it'll be all that fast, going at memory speed, but you never know till you benchmark.
If you moved the lots_of_code blocks into separate compilation units without the for loop, they will probably not vecorize. Unless the compiler has a motive for vectorization, it will not vectorize the code because vectorization might lead for longer latencies in the pipelines. To get around that, split the loop into 30 loops, and put each one of them in a separate compilation unit like that:
for (int i = 0; i < len; i++)
{
lots_of_code_a(i, in_arr1);
}
I try to program an example histogram tool using OpenCL. To start, I was just interessted to atomicly increment each bin. I came up with the following kernel code:
__kernel void Histogram(
__global const int* input,
__global int* histogram,
int numElements) {
// get index into global data array
int iGID = get_global_id(0);
// bound check, equivalent to the limit on a 'for' loop
if (iGID >= numElements) {
return;
}
if( iGID < 100 ) {
// initialize histogram
histogram[iGID] = 0;
}
barrier(CLK_GLOBAL_MEM_FENCE);
int bin = input[iGID];
atomic_inc(&histogram[bin]);
}
But the output histogram is zero in every bin. Why is that? Further more, the real strange things happen if a put a printf(" ") in the last line. Suddenly, it works. I am completely lost, has someone an idea why this happens?
P.S.
I enabled all extensions
I solved to problem by my self.
After nothing fixed the problem, I tried to change the CLDevice to the CPU. Everything went as it was supposed to be (unfortunately very slow :D). But this gave me the idea that it might not be a code problem but a OpenCL infrastructure problem.
I updated the OpenCL platform of AMD and now everything works.
Thank you, in case you thought about my problem.
I am trying to program Blinky program from Keil complier to P89LPC936 microcontroller through a universal programmer(SuperPro). But the microcontroller is not running. But when i write a simple program in assambly and program the same hardware it works fine. Please I need help regarding it where i am doing wrong.
Here is code >>>
Code:
/* Blinky.C - LED Flasher for the Keil LPC900 EPM Emulator/Programmer Module */
#include <REG936.H> // register definition
void delay (unsigned long cnt)
{
while (--cnt);
}
void main()
{
unsigned char i;
P1M1 |= 0x20;
P1M2 &= 0xDF;
P2M1 &= 0xE7;
P2M2 |= 0x18;
delay (20000);
for(;;)
{ for (i = 0x01; i; i <<= 1)
{ P2 = i; // simulate running lights
delay (20000);
}
for (i = 0x80; i; i >>= 1)
{ P2 = i;
delay (20000);
}
}
}
Here is Hex file >>>
:10006B008F0B8E0A8D098C08780874FF12004DECEB
:06007B004D4E4F70F32210
:100003004391205392DF53A4E743A5187F207E4EEC
:100013007D007C0012006B7B01EB6013F5A07F2059
:100023007E4E7D007C0012006BEB25E0FB80EA7BBB
:1000330080EB60E3F5A07F207E4E7D007C00120004
:070043006BEBC313FB80EA25
:01004A002293
:04FFF00023001E00CC
:08FFF800000000000000000001
:030000000200817A
:0C00810078FFE4F6D8FD75810B02000347
:10004B007401FF3395E0FEFDFC080808E62FFFF670
:10005B0018E63EFEF618E63DFDF618E63CFCF622E9
:00000001FF
And here is the assembly code and its hex file which is working absolutely right.
Code:
; LPC936A1.A51
; Oct 7, 2010 PCB: ?
; Features: ?
; ?
$mod51
RL1 bit P2.3
RL2 bit P2.4
DSEG AT 20H
FLAG1: ds 1
STACK: ds 1
FRL1 bit FLAG1.0 ; Relay 1
CSEG
org 0H
ajmp Reset
org 30H
Reset: mov 0A5H,#0FFH
Start: mov c,FRL1 ;
mov RL1,c
cpl c
mov FRL1,c
mov RL2,c
acall Delay0
ajmp Start
Delay0: mov R7,#250
Delay: mov R6,#61
Delay1: nop
nop
nop
nop
nop
nop
nop
nop
djnz R6,Delay1
djnz R7,Delay
ret
Text: DB '(C) DIGIPOWER 2010'
Text0: DB ' LPC936A1 '
END
And its hex is
:020000000130CD
:1000300075A5FFA20092A3B3920092A411400133D0
:100040007FFA7E3D0000000000000000DEF6DFF2D7
:10005000222843292044494749504F5745522032CE
:0D006000303130204C5043393336413120CF
:00000001FF
Please help i m stuck.
Regards
Dani
I don't work with keil tools for a long time and I never used that micro, so probably I won't be able to help you much.
Did you tried running it on the emulator?
Try to put a breakpoint in main and check if it stops there. There might me some issue with c_start and your main isn't being called.
Look at the assembly of the initialization code and check for something odd. I think you can check the assembly code generated by the compiler. You might have to turn on some option to generate intermediate files
You might also check "Electronics and Robotics" at stackexchange. There you may find people working with electronics that might provide better help.
You say that you write a program in assembly and it works fine, but not in C. Have you verified that your C environment is configured to place your code and data in the correct spots in memory?
Also, some chips have a "reset vector" that is called when the chip is first powered and also when the chip resets. Does your C environment set this vector correctly? Does it put code that will jump to your program when it starts to run?
Disassemble or compile the C to assembler to see what the compiler is doing. What is working or not in your C program? does the led just glow? Your assembler looks to be burning about 140,000 instructions but the C maybe 40,000? that could make the difference between an led you can see with your eyes and one that looks to be on but not blinking.
The C program appears to be setting up registers that the assembler does not. is there a bug there? are they disabling something that shouldnt be touched?
bottom line is you need to move the two programs toward each other, complicate the assembler until it approaches what the C is doing and adjust the C toward the assembler (have to look at the output of the compiler though).
Try:
void delay (unsigned long cnt)
{
while (--cnt) {
#pragma asm
NOP
#pragma endasm
}
}