Optimising C code for small size - sharing static variables? - c

I have two functions, both are similar to this:
void Bit_Delay()
{
//this is a tuned tight loop for 8 MHz to generate timings for 9600 baud
volatile char z = 12;
while(z)
{
z++;
z++;
z++;
z++;
z -= 5;
}
}
(The second function is analogous instead it uses 18 instead of 12 for the counter).
The code works flawlessly as it is (with z appearing locally to each function internally), but I'm trying to cram a little more functionality into my executable before I hit the (horribly) limited FLASH memory available.
My thought was to promote the z variable to be a global one (a volatile static). Because these two functions are effectively atomic operations (it's a single-threaded CPU and there are no interrupts at play to interfere), I figured that these two functions could share the single variable, thus saving a tiny bit of stack manipulation.
This didn't work. It is clear that the compiler is optimising-out much of the code related to z completely! The code then fails to function properly (running far too fast), and the size of the compiled binary drops to about 50% or so.
I realised that I needed the z variable to be marked volatile to prevent the compiler from removing code it knows is counting a fixed (and thus reducible to a constant) number each time.
Question:
Can I optimise this any further, and trick the compiler into keeping both functions intact? I'm compiling with "-Os" (optimise for small binary).
Here's the entire program verbatim for those playing along at home...
#include <avr/io.h>
#define RX_PIN (1 << PORTB0) //physical pin 3
#define TX_PIN (1 << PORTB1) //physical pin 1
void Bit_Delay()
{
//this is a tuned tight loop for 8 MHz to generate timings for 9600 baud
volatile char z = 12;
while(z)
{
z++;
z++;
z++;
z++;
z -= 5;
}
}
void Serial_TX_Char(char c)
{
char i;
//start bit
PORTB &= ~TX_PIN;
Bit_Delay();
for(i = 0 ; i < 8 ; i++)
{
//output the data bits, LSB first
if(c & 0x01)
PORTB |= TX_PIN;
else
PORTB &= ~TX_PIN;
c >>= 1;
Bit_Delay();
}
//stop bit
PORTB |= TX_PIN;
Bit_Delay();
}
char Serial_RX_Char()
{
char retval = 0;
volatile char z = 18; //1.5 bits delay
//wait for idle high
while((PINB & RX_PIN) == 0)
{}
//wait for start bit falling-edge
while((PINB & RX_PIN) != 0)
{}
//1.5 bits delay
while(z)
{
z++;
z++;
z++;
z++;
z -= 5;
}
for(z = 0 ; z < 8 ; z++)
{
retval >>= 1; //make space for the new bit
retval |= (PINB & RX_PIN) << (8 - RX_PIN); //get the bit and store it
Bit_Delay();
}
return retval;
}
int main(void)
{
CCP = 0xd8; //protection signature for clock registers (see datasheet)
CLKPSR = 0x00; //set the clock prescaler to "div by 1"
DDRB |= TX_PIN;
PORTB |= TX_PIN; //idle high
while (1)
Serial_TX_Char(Serial_RX_Char() ^ 0x20);
}
The target CPU is an Atmel ATTiny5 microcontroller, the code above uses up 94.1% of the FLASH memory! If you connect to the chip using a serial port at 9600 Baud, 8N1, you can type characters in and it returns them with bit 0x20 flipped (uppercase to lowercase and vice-versa).
This is not a serious project of course, I'm just experimenting to see how much functionality I could cram into this chip. I'm not going to bother with rewriting this in assembly, I seriously doubt I could do a better job than GCC's optimiser!
EDIT
#Frank asked about the IDE / compiler I'm using...
Microchip Studio (7.0.2542)
The "All Options" string that is passed to the compiler avr-gcc...
-x c -funsigned-char -funsigned-bitfields -DDEBUG -I"C:\Program Files (x86)\Atmel\Studio\7.0\Packs\atmel\ATtiny_DFP\1.8.332\include" -Os -ffunction-sections -fdata-sections -fpack-struct -fshort-enums -g2 -Wall -mmcu=attiny5 -B "C:\Program Files (x86)\Atmel\Studio\7.0\Packs\atmel\ATtiny_DFP\1.8.332\gcc\dev\attiny5" -c -std=gnu99 -MD -MP -MF "$(#:%.o=%.d)" -MT"$(#:%.o=%.d)" -MT"$(#:%.o=%.o)"

I question the following assumption:
This didn't work. It is clear that the compiler is optimising-out much of the code related to z completely! The code then fails to function properly (running far too fast), and the size of the compiled binary drops to about 50% or so.
Looking at https://gcc.godbolt.org/z/sKdz3h8oP, it seems like the loops are actually being performed, however, for whatever reason each z++, when using a global volatile z goes from:
subi r28,lo8(-(1))
sbci r29,hi8(-(1))
ld r20,Y
subi r28,lo8((1))
sbci r29,hi8((1))
subi r20,lo8(-(1))
subi r28,lo8(-(1))
sbci r29,hi8(-(1))
st Y,r20
subi r28,lo8((1))
sbci r29,hi8((1))
to:
lds r20,z
subi r20,lo8(-(1))
sts z,r20
You will need to recalibrate your 12, 18, and 5 constants to get your baud rate correct (since fewer instructions are executed in each loop), but the logic is there in the compiled version.
To be clear: This looks really weird to me, the local volatile version is clearly not being compiled correctly. I did find an old gcc bug along these lines: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33970, but it seems to not cover the local variable case.

Can I optimise this any further,
Of course; code like this is extremely expensive — and fagile:
volatile char z = 12;
while(z)
{
z++;
z++;
z++;
z++;
z -= 5;
}
It's expensive because your are asking for code bloat just to waste some cycles. And it's fragile because minimal changes to the code might change the timings. Apart from that it triggers stack frames because local volatile vars will live in the stack frame.
To make things worse, you are using volatile char z as a loop variable!
Why to use _delay_ms and friends
AVR-LibC provides delay routines like _delay_us and delay_ms in <util/delay.h>. Advantage is that:
A specified amount of time is wasted, which is passed as a parameter. (The routines might waste more real time than expected if interrupts are on.)
Code size is minimal due to inline assembly, compiler built-ins like __builtin_avr_delay_cycles and code folding.
No more need for magic numbers like 12 or 18 in your code.
The delay time must evaluate to a compile-time constant, and optimization must be turned on1. Canonical use case is to compute delay time using F_CPU, baud rate etc. Suppose we want to delay x * 1/BAUD seconds, which is x * 1000000 / BAUD µs.
So let's change Bit_Delay to the following code, where we add -D F_CPU=8000000 to the command line options so that the delay routines have it available:
#define BAUD 9600
__attribute__((always_inline))
static inline void Bit_Delay (double x)
{
_delay_us (x * 1000000 / BAUD);
}
Then use it like Bit_Delay (1). Changing the 4 usages to that, the code size drops from 480 Bytes to 360 Bytes.
Also adjusting the wait for 1.5 bits to Bit_Delay (1.5) and fixing the loop variable to not be volatile, the code size drops to 180 Bytes.
Functions Serial_RX_Char and Serial_TX_Char are just called once statically, so the compiler can inline them provided we make them static. This reduces the code size further to 170 Bytes, 44 bytes of which are start-up code and vector table. Moreover, we do no more need stack frames (which were triggered by local volatile vars), and the function calls are inlined, which saves RAM. Not unimportant on a device like ATtiny5 with just 32 Bytes of RAM.
FYI, the (inlined) delay code is compiled as:
ldi r20,lo8(208)
ldi r21,hi8(208)
1: subi r20,1
sbci r21,0
brne 1b
nop
Magic 208 is basically F_CPU / BAUD / 4 folded by the compiler, where the division by 4 is because one turn of the delay loop takes 4 ticks.
Why not to use _delay_ms and friends
Busy-waiting is a waste of time and energy. For very short timing issues it might be in order, longer busy-wait may destroy timings of other parts of the code, because it blocks their execution. If possible, use timers + interrupts for that purpose; it saved energy (when sleep can be used when idling) and does not delay execution of unrelated code.
1So that code folding works as expected. Otherwise, the delay code will complain:
util/delay.h:112:3: warning: Compiler optimizations disabled; functions from <util/delay.h> won't work as designed

Related

speed up the AVR ISR

I am wondering if it is possible to speed up the ISR without changing the prescaler.
I have a timer with 2 compare registers A and B.
COMPA is used for a PWM output from around 22% up to 100%. This has a fixed frequency and I am not allowed to change it at least not much.
Now I would like to use the COMPB with around 4 times the speed but with a fixed duty cycle of 50%.
If I set the OCIE0B bit in TIMSK0 for the attiny13 can I do the following to speed things up?
Or am I misunderstanding something here?
ISR(TIM0_COMPB_vect){
switch (timing){
case 0:
OCR0B = 63;
PORTB ^= (1 << PB3);
timing = 1;
break;
case 1:
OCR0B = 127;
PORTB ^= (1 << PB3);
timing = 2;
break;
case 2:
OCR0B = 191;
PORTB ^= (1 << PB3);
timing = 3;
break;
case 3:
OCR0B = 255;
PORTB ^= (1 << PB3);
timing = 0;
break;
}
}
Any help appreciated.
Thanx :D
You can do this very efficiently by creatively using the Normal Mode.
The trick is to set the prescaller to get a clock period that is double what you want the variable-duty PWM signal to run at. So if, for example, you want that to PWM at 1Mhz, set the prescaller to 2Mhz.
Assume the variable duty cycle PWM is on pin A and the fixed 50% 4x clock signal is on pin B. (You can also swap these and and also update the code everything will still work)
Enable interrupts for "On compare match B" and "Overflow".
Force pin A high with a force compare match. (Alternately you can skip this step and instead use the inverse of the desired duty cycle in step 7)
Set the COM bits for 'A' to Toggle on match.
Leave the COM bits for B to off. Assumes you have DDR set for this pin to be normal GPIO.
Set the OCR for B to 128.
Set the WGM timer mode to 0 - "Normal Mode".
Set the OCR for A whatever you want the variable duty cycle to be. Note that you might need to special case here for extreme values of 0 and/or 255 depending on what you want to have happen (just turn the pin ON of OFF with GPIO).
You can repeat step 6 anytime you want to change the duty cycle of A and it will update on the next TOP.
Once you do these steps, the A pin will output the desired duty cycle at 1/2 the prescaller clock and the B will output 50% duty at 2x the prescaller clock (which is the desired 4x of the A period).
Here is the ISR code (note that I am not sure what the TOV vector is called in the attiny13 headers [it is sometimes different across AVRs] so you might have to edit the TIM0_OVF_vect name)...
ISR(TIM0_COMPB_vect,TIM0_OVF_vect){
PINB |= (1 << PB3); // Compiles to a single cycle SBI
}
See how this works?
Note that setting a bit in the PIN register actually toggles the PORT bit. This is a quirk of the AVR GPIOs that is documented in the datasheets.
Hopefully this is fast enough. If you really want to squeeze every last cycle out, you can even potentially save the 2 cycles of the RJMP from the interrupt vector by putting the single SBI instruction that the ISR compiles down to directly into the interrupt vector table with a trailing RETI, but this is more complicated!
Focusing solely on the C code aspects, then this can be trivially optimized as:
ISR(TIM0_COMPB_vect)
{
static const uint8_t OCR[4] = {63,127,191,255};
OCR0B = OCR[timing];
PORTB ^= 1u << PB3;
timing++;
if(timing==4)
timing=0;
}
Disassembled on gcc AVR -O3 (with all variables/registers volatile) this brings down the amount of instructions from ~50 to ~20, so it's about twice as fast and takes less memory.
If you just want the fastest equivalent version of the supplied code, then here it is...
ISR(TIM0_COMPB_vect){
OCR0B += 64;
PINB |= (1 << PB3);
}
OCR0B will overflow every 4 passes, which is defined behavior. Probably wise to initialize OCR0B to some non-zero number like 1 to avoid edge cases.
This avoids all variables and memory access - only register access.
Avoids all compares and braches.
The PINB method of toggling the pin compiles down to a single SBI instruction rather than a PUSH, load, XOR, store, POP sequence.
...but again, none of this matters if it does not work and unless you are using one of the two "immediate OCR update" modes, then updating OCR in the middle of a timer cycle will have no effect.

How to increase brightness or dim the LED using pwm atmega avr

I dont know why but instead of increasing brightness, LED pulses, the period between each pulse is getting shorter. This is copied code from tutorial, in his video it worked fine but for me it didnt, even in simulator. How can that happen?
Using avr 328p.
#define F_CPU 20000000
#include <avr/io.h>
#include <avr/interrupt.h>
#include <util/delay.h>
double dutyCycle = 0;
int main(void)
{
DDRD = (1 << PORTD6);
TCCR0A = (1 << COM0A1) | (1 << WGM00) | (1 << WGM01);
TIMSK0 = (1 << TOIE0);
OCR0A = (dutyCycle/100.0)*255.0;
sei();
TCCR0B = (1 << CS00) | (1 << CS02);
while(1)
{
_delay_ms(100);
dutyCycle += 10;
if(dutyCycle > 100){
dutyCycle = 0;
}
}
}
ISR(TIMER0_OVF_vect){ OCR0A = (dutyCycle/100.0)*255;}
1) If some variable is used simultaneously in the main code and in an interrupt, then it has to be marked as volatile. Then every read or write to it will be compiled as reading/writing of the corresponding memory cell. Otherwise, the compiler can optimize, minimizing memory access. So, writing to the variable inside the main program will not be visible in the interrupt.
2) Why are you using double? Do not use floating point unless it strongly necessary. AVR has no hardware support for the floating-point arithmetic, so each floating-point operation will be represented as multiple operations. In your example, nothing stops from use integer variable which changes from 0 to 255. Even if you want to use 0-100 range variable, you can recalculate it using integer arithmetics.
3) Be aware updating variables which are more than 1 byte long. AVR is an 8-bit architecture. That means, updating of variables in memory more than 8-bit wide, requires a series of multiple operations. double which is 8 bytes long, requires too much of such operations. The interrupt may fire in any moment in the middle of that series, that means, the value of the variable, obtained inside the ISR will be updated only partially, causing unpredictable results. In the main code eclose in cli() - sei() any update of variables which are used inside the ISR and more than 1 byte wide.
3) Avoid hard calculations in the ISR. As a rule of thumb: any ISR should complete as soon as possible, all intense calculations should be placed outside the ISR.
4) In this example, you need no ISR at all! You can write OCR0A just inside the main code.

Volatile Keyword - MSP430

I'm trying to flash an LED on a TI MSP430 Launchpad board. I have two pieces of code. One works, while the other doesn't. The only difference is the inclusion of the volatile keyword in working version. Why is this keyword needed for the program to execute?
This code works...
void main(void) {
WDTCTL = WDTPW | WDTHOLD; // Stop watchdog timer
// Configure Port Directions
P1DIR |= 0x01; // 0000 0001
volatile unsigned int i;
for(;;)
{
P1OUT ^= 0x01; // Set P1.0 LED on
for (i = 20000; i > 0; i--); // Delay
}
}
While this code does not...
void main(void) {
WDTCTL = WDTPW | WDTHOLD; // Stop watchdog timer
// Configure Port Directions
P1DIR |= 0x01; // 0000 0001
unsigned int i;
for(;;)
{
P1OUT ^= 0x01; // Set P1.0 LED on
for (i = 20000; i > 0; i--); // Delay
}
}
Without volatile, the compiler has a lot more liberty in optimizing out code which it determines does nothing, as well as reordering memory access. Your delay loop is being optimized out when not using volatile.
Neither version is any good, future versions of the compiler may generate vastly different code.
Most MSP430 development tools provide the intrinsic functions __delay_cycles() intended to be used when you want to wait a specific number of cycles.
For example:
#include <intrinsics.h>
void main(void)
{
WDTCTL = WDTPW | WDTHOLD; // Stop watchdog timer
// Configure Port Directions
P1DIR |= 0x01; // 0000 0001
for(;;)
{
P1OUT ^= 0x01; // Set P1.0 LED on
__delay_cycles(40000);
}
}
Note that the code generated for this will execute at full processor speed. If you need a longer delay in a power-restricted environment, please consider using timers and put the processor in low-power mode.
If you add a NOP in your second version in the loop:
for (i = 20000; i > 0; i--) {
asm volatile("nop");
}
it should work as well. In both cases the volatile is needed to prevent optimization. In the first version, it prevents the compiler from completely removing the loop. In the second version With asm it tells the compiler to leave it where it is (so it's not moved to another location).
That being sad, both versions are not considered to be good style: Consider using a timer for exact busy delays. The loops will not do what you want, if the core frequency is changed.
In examining the assembly output of IAR compiler (V4.21.9 for MSP430F5438) the infinite loop is always compiled in, with or without the volatile keyword. (Up to Medium optimization setting.) So this may be a compiler dependency. For sure try compiling with optimizations off.
Where the volatile keyword is important is to tell the compiler not to count on a value and hence reread it. For instance, you could be reading an input buffer receiving an external character. The compiler needs to be told to keep reading, since the buffer is updated by something out of its scope of knowledge.
I prefer a solution that works on every compiler, call a function from another module that is not optimized or not optimized with the loop. Asm is a good example. a dummy function that just returns
dummy:
ret
...
void dummy ( unsigned int );
unsigned int ra;
for(ra=0;ra<10000;ra++) dummy(ra);
The compiler can unroll the loop some if it wants but will have to call dummy with the right arguments in the right order, you can use maximum optimization on the C code without worry.
if you don't declare it volatile, then a lot of the compiler will perform run time optimization, hence you may not pick up the changes

Problems in AVR C combining ADC readings to generate PWM output

I'm writing a program for an ATMega328P that will take readings from several ADC channels, combine them into a single signal and output this signal through PWM.
I've successfully backed off my ADC polling to 50Hz per channel using Single Conversion mode. I'm using Timer/Counter2 for PWM generation, and Timer/Counter1 for doing the calculations I need to do to set compare values for Timer/Counter2. This is the ISR for Timer/Counter1:
// Interrupt service routine called to generate PWM compare values
ISR(TIMER1_COMPA_vect)
{
// Grab most recent ADC reading for ADC0
uint32_t sensor_value_0 = adc_readings[0];
// Get current value for base waveform from wavetable stored in sinewave_data
uint32_t sample_value_0 = pgm_read_byte(&sinewave_data[sample_0]);
// Multiply these two values together
// In other words, use the ADC reading to modulate the amplitude of base wave
uint32_t sine_0 = (sample_value_0 * sensor_value_0) >> 10;
// Do the same thing for ADC2
uint32_t sensor_value_1 = adc_readings[1];
uint32_t sample_value_1 = pgm_read_byte(&sinewave_data[sample_1]);
uint32_t sine_1 = (sample_value_1 * sensor_value_1) >> 10;
// Add channels together, divide by two, set compare register for PWM
OCR2A = (sine_0 + sine_1) >> 1;
// Move successive ADC base waves through wavetable at integral increments
// i.e., ADC0 is carried by a 200Hz sine wave, ADC1 at 300Hz, etc.
sample_0 += 2;
sample_1 += 3;
// Wrap back to front of wavetable, if necessary
if (sample_0 >= sinewave_length) {
sample_0 = 0;
}
if (sample_1 >= sinewave_length) {
sample_1 = 0;
}
} // END - Interrupt service routine called to generate PWM compare values
My problem is that that I get no PWM output. If I set either sensor_value_0 or sensor_value_1 to 1024 and leave the other sensor_value_ set to read from the ADC, I do get one full-amplitude component wave, and an amplitude-modulated component wave. If however, I choose a different value for the hardcoded, mock amplitude, I am not so lucky (1023, for instance). Any other values give me no PWM output. If I set both sensor_value_s to look at the same ADC channel, I would expect two component waves whose amplitudes are modulated identically. Instead, I get no PWM output. What is most confusing of all to me is that if I choose a value for the hardcoded amplitude that is an exact power of two, all is well.
The whole power-of-two part makes this seem to me to be a bit-twiddling issue that I'm not seeing. Can you see what I must have clearly missed? I'd appreciate any tips at all!
(I've posted my entire source here to keep things as neat as possible on SO.)
Your issue may be caused by the architecture of the AVR that you're developing on. The ATMega328p has 8 bit registers, similar to most other AVR chips. This means that the 32b values that you're working with must be stored in memory by the compiler and broken up into four separate registers every time you perform arithmetic on them. In fact, there are no arithmetic instructions that perform on more than one register at once, so I'm really not sure what the compiler is doing!
I'd be interested to know what the disassembly of your code is, but my guess is that gcc is using the MUL instruction to execute the sample_value_0 * sensor_value_0 code. This instruction operates on two 8b values and produces a 16b value, so I wouldn't be surprised if the reason you're seeing an odd dependence on multiples of two produce results.
I'd say try reworking this block of code by changing the data types of the variables. Use uint8_t for sensor_value_* and sample_value_*, and uint16_t for sine_*. Then, to make sure everything fits in the 8b OCR2A register, change the assignment to something like:
OCR2A = (sine_0 + sine_1) & 0xFF;
#Devrin, I appreciate the response, but just manipulating types didn't do it for me. Here's what I ended up doing:
uint8_t sine_0 = (pgm_read_byte(&sinewave_data[sample_0]) >> 5) * (adc_readings[1] >> 5);
uint8_t sine_1 = (pgm_read_byte(&sinewave_data[sample_1]) >> 5) * (adc_readings[2] >> 5);
OCR2A = (sine_0 >> 1) + (sine_1 >> 1);
Essentially, I've done all my shifting immediately, instead of waiting until the last minute. Unfortunately, I lose a lot of precision, but at least the code works as expected. Now, I wil begin cranking things back up to find the initial cause of my issues.

cycling through leds

Please help me with this code, it is making me crazy. This is a very simple program with 8-bit timer, cycling through all 8 leds (one-by-one). Am using ATSTK600 board.
My timers are working well, I think there is some problem with the loops (when I debug this program using avr studio-gcc, I can see all the leds working as I want but when I transfer it on board...leds don't blink). Am going crazy with this type of behavior.
Here is my code:
#include <avr/io.h>
#include <avr/interrupt.h>
volatile unsigned int intrs, i, j = 0;
void enable_ports(void);
void delay(void);
extern void __vector_23 (void) __attribute__ ((interrupt));
void enable_ports()
{
DDRB = 0xff;
TCCR0B = 0x03;
TIMSK0 = 0x01;
//TIFR0 = 0x01;
TCNT0 = 0x00;
//OCR0A = 61;
intrs = 0;
}
void __vector_23 (void)
{
for(i = 0; i<=8; i++)
{
while(1)
{
intrs++;
if(intrs >= 61)
{
PORTB = (0xff<<i);
intrs = 0;
break;
}
}
}
PORTB = 0xff;
}
int main(void)
{
enable_ports();
sei();
while(1)
{
}
}
Your interrupt routine is flawed. intrs counts only the number of times the loop has executed, not the number of timer interrupts as its name suggests. 61 iterations of that loop will take very little time. You will see nothing perceivable without an oscilloscope.
The following may be closer to what you need:
void __vector_23 (void)
{
intrs++;
if(intrs > 60)
{
intrs = 0;
PORTB = (0xff<<i);
i++ ;
if(i == 8 )
{
i = 0 ;
PORTB = 0xff;
}
}
}
Although setting the compare register OCR0A to 61 as in your commented out code would avoid the need for the interrupt counter and reduce unnecessary software overhead.
Are you sure that the code downloaded to the board is not optimized?
Have you attached volatile attribute to the PORTB identifier?
Is there a way for you to slow down the code (outside the debugger)? Any chance it's running but fast that you don't see it?
Can you verify that your intended code is in fact running (outside the debugger)?
When interrupt occurs, handler very quickly counts 62*9 times and finally sets PORTB to 0x00, so leds do only very short flash which is not visible. You see it in sumulator just because it runs slower and do not emulate visual dimming effect of fast port switching. Program has a design flaw: it tries to do full blinking cycle in single interrupt. That's wrong--only a single step should be performed in interrupt call. So handler should look like this:
void __vector_23 (void)
{
intrs++;
if(intrs >= 61)
{
PORTB = (0xff<<i);
intrs = 0;
i++;
if(i>8) i = 0;
}
}
Try this.
There is guidelin on interrupts handlers: Interrupt handler should be as fast and short as possible. Do not perform complex tasks in interrupts (cycle loop is one of them, if you get cycle in interrupt, try to remove it). Do not wait or delay in interrupts.
If you're seeing the behaviour you want when debugging with avr studio-gcc, then that gives you some confidence that your program is "good" (for some sense of the word "good"). So it sounds as though you need to focus on a different area: what is the difference between your debug environment and your stand-alone download?
When doing a stand-alone download, do you know if your program is running at all?
Are the LEDs blinking, or turning on at all? You don't explicitly say in your question, but that question could be very relevant to the debugging process. Does it look like the right behaviour, running at a different speed? If so, then your program is probably not doing some sort of initialisation that the debugger was doing.
When doing a stand-alone download, is the program being compiled with different settings compared to the debug version? Perhaps compiler optimisation settings are changing your program's timing characteristics.
(Your question would be better if you gave more detail about what the stand-alone download is doing. In general, it is hard for someone to debug a remote system when they're given few or no details about what is happening. Do all/some of the LEDs turn on at all?)

Resources