ARMCC 5 optimization of strtol and strtod

ARMCC 5 optimization of strtol and strtod - c

I have a board based on STM32L4 MCU (Ultra Low Power Cortex-M4) for GNSS tracking purposes. I don't use RTOS, so I use a custom scheduler. Compiler and environment is KEIL uVision 5 (compiler 5.05 and 5.06, behavior doesn't change)
The MCU speaks with GNSS module via plain UART and the protocol is a mix of NMEA and AT. GNSS position is given as plain text that must be converted to a pair of float/double coordinates.
To get the double/float value from text, I use strtod (or strtof).
Note that string operations are made in a separate buffer, different from the UART RX one.
The typical string for a latitude on the UART is
4256.45783
which means 42° 56.45783'
to get absolute position in degrees, I use the following formula
42 + 56.45783 / 60
When there is no optimization the code works fine and the position is converted right. When I turn on level 1 optimization (or higher), if I use standard C library I can convert the integer part (42 in the example) and when it comes to convert 56.45783, I get only 56 (so the integer part of minutes until the dot).
If I get rid of standard library and I use a custom strtod function downloaded from ANSI C source library I simply get 0 with ERANGE error.
In other parts of the code I use strtol, which has a strange behavior when L1 optimization is turned ON: when the first digit is 9 and conversion base is 10 it simply skips that 9 going on with the other digits.
So if in the buffer I have 92, I will get just 2 parsed. To get rid of this I simply prepended a sign + to the number and the result is always OK (as far as I can tell). This WA doesn't work with strtod.
Note that I tried to use static, volatile and on-stack variables, behavior doesn't change.
EDIT: I simplified the code in order to get where it goes wrong, as per comments hereafter
C code is like this:
void GnssStringToLatLonDegMin(const char* str, LatLong_t* struc)
{
double dbl = 0.0;
dbl = strtod("56.45783",NULL);
if(struc != NULL)
{
struc->Axis = (float)((dbl / 60.0) + 42.0);
}
}
Level 0 optimization:
559: void GnssStringToLatLonDegMin(const char* str, LatLong_t* struc)
0x08011FEE BDF8 POP {r3-r7,pc}
560: {
0x08011FF0 B570 PUSH {r4-r6,lr}
0x08011FF2 4605 MOV r5,r0
0x08011FF4 ED2D8B06 VPUSH.64 {d8-d10}
0x08011FF8 460C MOV r4,r1
561: double dbl = 0.0;
0x08011FFA ED9F0BF8 VLDR d0,[pc,#0x3E0]
0x08011FFE EEB08A40 VMOV.F32 s16,s0
0x08012002 EEF08A60 VMOV.F32 s17,s1
562: dbl = strtod("56.45783",NULL);
0x08012006 2100 MOVS r1,#0x00
0x08012008 A0F6 ADR r0,{pc}+4 ; #0x080123E4
0x0801200A F7FDFED1 BL.W __hardfp_strtod (0x0800FDB0)
0x0801200E EEB08A40 VMOV.F32 s16,s0
0x08012012 EEF08A60 VMOV.F32 s17,s1
563: if(struc != NULL)
564: {
0x08012016 B1A4 CBZ r4,0x08012042
565: struc->Axis = (float)((dbl / 60.0) + 42.0);
566: }
0x08012018 ED9F0BF5 VLDR d0,[pc,#0x3D4]
0x0801201C EC510B18 VMOV r0,r1,d8
0x08012020 EC532B10 VMOV r2,r3,d0
0x08012024 F7FEF880 BL.W __aeabi_ddiv (0x08010128)
0x08012028 EC410B1A VMOV d10,r0,r1
0x0801202C ED9F0BF2 VLDR d0,[pc,#0x3C8]
0x08012030 EC532B10 VMOV r2,r3,d0
0x08012034 F7FDFFBC BL.W __aeabi_dadd (0x0800FFB0)
0x08012038 EC410B19 VMOV d9,r0,r1
0x0801203C F7FDFF86 BL.W __aeabi_d2f (0x0800FF4C)
0x08012040 6020 STR r0,[r4,#0x00]
567: }
LEVEL 1 optimization
557: void GnssStringToLatLonDegMin(const char* str, LatLong_t* struc)
0x08011FEE BDF8 POP {r3-r7,pc}
558: {
559: double dbl = 0.0;
0x08011FF0 B510 PUSH {r4,lr}
0x08011FF2 460C MOV r4,r1
560: dbl = strtod("56.45783",NULL);
0x08011FF4 2100 MOVS r1,#0x00
0x08011FF6 A0F7 ADR r0,{pc}+2 ; #0x080123D4
0x08011FF8 F7FDFEDA BL.W __hardfp_strtod (0x0800FDB0)
561: if(struc != NULL)
562: {
0x08011FFC 2C00 CMP r4,#0x00
0x08011FFE D010 BEQ 0x08012022
563: struc->Axis = (float)((dbl / 60.0) + 42.0);
564: }
0x08012000 ED9F1BF7 VLDR d1,[pc,#0x3DC]
0x08012004 EC510B10 VMOV r0,r1,d0
0x08012008 EC532B11 VMOV r2,r3,d1
0x0801200C F7FEF88C BL.W __aeabi_ddiv (0x08010128)
0x08012010 ED9F1BF5 VLDR d1,[pc,#0x3D4]
0x08012014 EC532B11 VMOV r2,r3,d1
0x08012018 F7FDFFCA BL.W __aeabi_dadd (0x0800FFB0)
0x0801201C F7FDFF96 BL.W __aeabi_d2f (0x0800FF4C)
0x08012020 6020 STR r0,[r4,#0x00]
565: }
I looked at the disassembly of __hardfp_strtod and __strtod_int called by these functions and, as they are incorporated as binaries, they don't change with respect of optimization level.

Due to optimization, strtod didn't work.
Thanks to #old_timer, I had to make my own strtod function, which works even with optimization level set at level 2.
double simple_strtod(const char* str)
{
int8 inc;
double result = 0.0;
char * c_tmp;
c_tmp = strchr(str, '.');
if(c_tmp != NULL)
{
c_tmp++;
inc = -1;
while(*c_tmp != 0 && inc > -9)
{
result += (*c_tmp - '0') * pow(10.0, inc);
c_tmp++; inc--;
}
inc = 0;
c_tmp = strchr(str, '.');
c_tmp--;
do
{
result += (*c_tmp - '0') * pow(10.0,inc);
c_tmp--; inc++;
}while(c_tmp >= str);
}
return result;
}
It can be further optimized by not calling 'pow' and use something more clever, but just like this it works perfectly.

Related

How can I make this loop run faster?

I'm using this code to find the highest temperature pixel in a thermal image and the coordinates of the pixel.
void _findMax(uint16_t *image, int sz, sPixelData *returnPixel)
{
int temp = 0;
for (int i = sz; i > 0; i--)
{
if (returnPixel->temperature < *image)
{
returnPixel->temperature = *image;
temp = i;
}
image++;
}
returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
}
With an image size of 640x480 it takes around 35ms to run through this function, which is too slow for what I need it for (under 10ms ideally).
This is executing on an ARM A9 processor running Linux.
The compiler I'm using is ARM v8 32-Bit Linux gcc compiler.
I'm using optimize -O3 and the following compile options: -march=armv7-a+neon -mcpu=cortex-a9 -mfpu=neon-fp16 -ftree-vectorize.
This is the output from the compiler:
000127f4 <_findMax>:
for(int i = sz; i > 0; i--)
127f4: e3510000 cmp r1, #0
{
127f8: e52de004 push {lr} ; (str lr, [sp, #-4]!)
for(int i = sz; i > 0; i--)
127fc: da000014 ble 12854 <_findMax+0x60>
12800: e1d2c0b0 ldrh ip, [r2]
12804: e2400002 sub r0, r0, #2
int temp = 0;
12808: e3a0e000 mov lr, #0
if(returnPixel->temperature < *image)
1280c: e1f030b2 ldrh r3, [r0, #2]!
12810: e153000c cmp r3, ip
returnPixel->temperature = *image;
12814: 81a0c003 movhi ip, r3
12818: 81a0e001 movhi lr, r1
1281c: 81c230b0 strhhi r3, [r2]
for(int i = sz; i > 0; i--)
12820: e2511001 subs r1, r1, #1
12824: 1afffff8 bne 1280c <_findMax+0x18>
12828: e30c3ccd movw r3, #52429 ; 0xcccd
1282c: e34c3ccc movt r3, #52428 ; 0xcccc
12830: e0831e93 umull r1, r3, r3, lr
12834: e1a034a3 lsr r3, r3, #9
12838: e0831103 add r1, r3, r3, lsl #2
1283c: e6ff3073 uxth r3, r3
12840: e04ee381 sub lr, lr, r1, lsl #7
12844: e6ffe07e uxth lr, lr
returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
12848: e1c2e0b4 strh lr, [r2, #4]
returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
1284c: e1c230b6 strh r3, [r2, #6]
}
12850: e49df004 pop {pc} ; (ldr pc, [sp], #4)
for(int i = sz; i > 0; i--)
12854: e3a03000 mov r3, #0
12858: e1a0e003 mov lr, r3
1285c: eafffff9 b 12848 <_findMax+0x54>
For clarity after comments:
Each pixel is a unsigned 16 bit integer, image[0] would be the pixel with coordinates 0,0, and the last in the array would have the coordinates 639,479.

This is executing on an ARM A9 processor running Linux.
ARM Cortex-A9 supports Neon.
With this in mind the goal should be to load 8 values (128 bits of pixel data) into a register, then do "compare with the current maximums for each of the 8 places" to get a mask, then use the mask and its inverse to mask out the "too small" old maximums and the "too small" new values; then OR the results to merge the new higher values into the "current maximums for each of the 8 places".
Once that has been done for all pixels (using a loop); you'd want to find the highest value in the "current maximums for each of the 8 places".
However; to find the location of the hottest pixel (rather than just how hot it is) you'd want to split the image into tiles (e.g. maybe 8 pixels wide and 8 pixels tall). This allows you to find the max. temperature within each tile (using Neon); then find the pixel within the hottest tile. Note that for huge images this lends itself to a "multi-layer" approach - e.g. create a smaller image containing the maximum from each tile in the original image; then do the same thing again to create an even smaller image containing the maximum from each "group of tiles", then ...
Making this work in plain C means trying to convince the compiler to auto-vectorize. The alternatives are to use compiler intrinsics or inline assembly. In any of these cases, using Neon to do 8 pixels in parallel (without any branches) could/should improve performance significantly (how much depends on RAM bandwidth).

You should minimize memory access, especially in loops.
Every * or -> could cause unnecessary memory access resulting in serious performance hits.
Local variables are your best friend:
void _findMax(uint16_t *image, int sz, sPixelData *returnPixel)
{
int temp = 0;
uint16_t temperature = returnPixel->temperature;
uint16_t pixel;
for (int i = sz; i > 0; i--)
{
pixel = *image++;
if (temperature < pixel)
{
temperature = pixel;
temp = i;
}
}
returnPixel->temperature = temperature;
returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
}
Below is how this can be optimized by utilizing neon:
#include <stdint.h>
#include <arm_neon.h>
#include <assert.h>
static inline void findMax128_neon(uint16_t *pDst, uint16x8_t *pImage)
{
uint16x8_t in0, in1, in2, in3, in4, in5, in6, in7, in8, in9, in10, in11, in12, in13, in14, in15;
uint16x4_t dmax;
in0 = vld1q_u16(pImage++);
in1 = vld1q_u16(pImage++);
in2 = vld1q_u16(pImage++);
in3 = vld1q_u16(pImage++);
in4 = vld1q_u16(pImage++);
in5 = vld1q_u16(pImage++);
in6 = vld1q_u16(pImage++);
in7 = vld1q_u16(pImage++);
in8 = vld1q_u16(pImage++);
in9 = vld1q_u16(pImage++);
in10 = vld1q_u16(pImage++);
in11 = vld1q_u16(pImage++);
in12 = vld1q_u16(pImage++);
in13 = vld1q_u16(pImage++);
in14 = vld1q_u16(pImage++);
in15 = vld1q_u16(pImage);
in0 = vmaxq_u16(in1, in0);
in2 = vmaxq_u16(in3, in2);
in4 = vmaxq_u16(in5, in4);
in6 = vmaxq_u16(in7, in6);
in8 = vmaxq_u16(in9, in8);
in10 = vmaxq_u16(in11, in10);
in12 = vmaxq_u16(in13, in12);
in14 = vmaxq_u16(in15, in14);
in0 = vmaxq_u16(in2, in0);
in4 = vmaxq_u16(in6, in4);
in8 = vmaxq_u16(in10, in8);
in12 = vmaxq_u16(in14, in12);
in0 = vmaxq_u16(in4, in0);
in8 = vmaxq_u16(in12, in8);
in0 = vmaxq_u16(in8, in0);
dmax = vmax_u16(vget_high_u16(in0), vget_low_u16(in0));
dmax = vpmax_u16(dmax, dmax);
dmax = vpmax_u16(dmax, dmax);
vst1_lane_u16(pDst, dmax, 0);
}
void _findMax_neon(uint16_t *image, int sz, sPixelData *returnPixel)
{
assert((sz % 128) == 0);
const uint32_t nSector = sz/128;
uint16_t max[nSector];
uint32_t i, s, nMax;
uint16_t *pImage;
for (i = 0; i < nSector; ++i)
{
findMax128_neon(&max[i], (uint16x8_t *) &image[i*128]);
}
s = 0;
nMax = max[0];
for (i = 1; i < nSector; ++i)
{
if (max[i] > nMax)
{
s = i;
nMax = max[i];
}
}
if (nMax < returnPixel->temperature)
{
returnPixel->x_location = 0;
returnPixel->y_location = 0;
return;
}
pImage = &image[s];
i = 0;
while(1) {
if (*pImage++ == nMax) break;
i += 1;
}
i += 128 * s;
returnPixel->temperature = nMax;
returnPixel->x_location = i % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = i / IMAGE_HORIZONTAL_SIZE;
}
Beware that the function above assumes sz being a multiple of 128.
And yes, it will run in less than 10ms.

The culprit here is the slow linear search for the highest "temperature". I'm not quite sure how to improve that search algorithm with the information given, if at all possible (could you sort the data in advance?), but you could start with this:
uint16_t max = 0;
size_t found_index = 0;
for(size_t i=0; i<sz; i++)
{
if(max < image[i])
{
max = image[i];
found_index = sz - i - 1; // or whatever makes sense here for the algorithm
}
}
returnPixel->temperature = max;
returnPixel->x_location = found_index % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = found_index / IMAGE_HORIZONTAL_SIZE;
This might give a very slight performance gain because of the top to bottom iteration order and not touching unrelated memory returnPixel in the middle of the loop. max should get stored in a register and with luck you might get slightly better cache performance overall. Still, this comes with a branch like the original code, so it is a minor improvement.
Another micro-optimization is to change the parameter to const uint16_t* image - this might give slightly better pointer aliasing, in case returnPixel happens to contain a uint16_t too. image should be const regardless of performance, since const correctness is always good practice.
Further more obscure optimization tricks might be possible if you read image 32 or 64 bits at a time, then come up with a fast look-up method to find the largest image inside that 32/64 bit chunk.

If you have to find the hottest pixel in the image and if there is no structure to the image data itself then I think your stuck with iterating through the pixels. If so you have a number of different ways to make this faster:
As suggested above try loop unrolling and other micro-optimisation tricks, this might give you the performance boost you need
Go parallel, split the array into N chunks and find the MAX[N] for each chunk and then find the largest of the MAX[N] values. You have to be careful here as setting up the parallel processes can take longer than doing the work.
If there is some structure to the image, lots of cold pixels and a hot spot (larger than 1 pixel) that your trying to find say, then there are other techniques you could use.
One approach could be to split the image up into N boxes and then sample each box, The hottest pixel in the hottest box (and maybe in the boxes adjacent too it) would then be your result. However this depends on their being some structure to the image which you can rely on.

The assembly language reveals the compiler is storing in returnPixel->temperature each time a new maximum is found, with the instruction strhhi r3, [r2]. Eliminate this unnecessary store by caching the maximum in a local object and only updating returnPixel->temperature after the loop ends:
uint16_t maximum = returnPixel->temperature;
for (int i = sz; i > 0; i--)
{
if (maximum < *image)
{
maximum = *image;
temp = i;
}
image++;
}
returnPixel->temperature = maximum;
That is unlikely to reduce the execution time as much as you need, but it might if there is some bad cache or memory interaction occurring. It is a very simple change, so try it before moving on to the SIMD vectorizations suggested in other answers.
Regarding vectorization, two approaches are:
Iterate through the image using a vmax instruction to update the maximum value seen so far in each SIMD lane. Then consolidate the lanes to find the overall maximum. Then iterate through the image again looking for that maximum in any lane. (I forget what that architecture has for instructions that would assist in testing whether a comparison produced true in any lane.)
Iterate through the image maintaining three registers: One with the maximum seen so far in each lane, one with a counter of position in the image, and one with, in each lane, a record of the counter value at the time each new maximum was seen. The first can be updated with vmax, as above. The second can be updated with vadd. The third can be updated with vcmp and vbit. After the loop, figure out which lane has the maximum, and get the position of the maximum from the recorded counter for that lane.
Depending on the performance of the necessary instructions, a hybrid approach may be faster:
Set some strip size S. Partition the image into strips of that size. For each strip in the image, find the maximum (using the fast vmax loop described above). If the maximum is greater than seen in previous strips, remember it and the current strip number. After processing the whole image, the task has been reduced to finding the location of the maximum in a particular strip. Use the second loop from the first approach above for that. (For large images, further refinements may be possible, possibly refining the location using a shorter strip size before finding the exact location, depending on cache behavior and other factors.)

In my opinion you cannot improve that algorithim, because any array element can hold the maximum, so you need to do at least one pass through the data, I don't believe you can improve this without going multithreading. you can start several threads (as many as cores/processors you may have) and give each a subset of your image. Once they are finished, you will have as many local maximum values as the number of threads you started. just do a second pass on those vaues to get the maximum total value, and you are finished. But consider the extra workload of creating threads, allocating stack memory for them, and scheduling, as that can be higher if the number of values is low than the workload of running all in a single thread. If you have a thread pool somewhere to provide ready to run threads, and that is something you can get on, then probably you'll be able to finished in one Nth part of the time to run all the loop in one single processor (where N is the number of cores you have on the machine)
Note: using a dimension that is a power of two will save you the job of calculating a quotient and a remainder by solving the division problem with a bit shift and a bit mask. You use it only once in your function, but it's an improving, anyway.

Floating point numbers and the effect on 8-bit microcontrollers memory

I am currently working on a project that includes bare-metal programming on an stm-8 micro-controller using the SDCC compiler in linux. The memory in the chip is quite low so I'm trying to keep things really lean. I have gotten by with using 8-bit and 16-bit variables and things have gone well. But recently I ran into a problem were I really needed a float variable. So i wrote a function that takes in a 16-bit value converts to a float does the math I need and returns an 8-bit number. This cause my final compiled code on the MCU to go from 1198 Bytes to 3462 Bytes. Now I understand that using floating points is memory intensive and that many functions may need to be called to handle the use of the floating point number but it seems crazy to increase the size of the program by that much. I would like some help understanding why this is and what happened exactly.
Specs: MCU stm8151f2
Compiler: SDCC with --opt_code_size option
int roundNo(uint16_t bit_input)
{
float num = (((float)bit_input) - ADC_MIN)/124.0;
return num < 0 ? num - 0.5 : num + 0.5;
}

To determine why the code is so large on your particular tool chain, you would need to look at the generated assembly code, and see what FP support calls it makes, then look at the map file to determine the size of each of those functions.
As an example on Godbolt for AVR using GCC 5.4.0 with -Os (Godbolt does not support STM8 or SDCC so this is for comparison as a 8-bit architecture) your code generates 6364 bytes compared 4081 bytes for an empty function. So the additional code required for the code body is 2283 bytes. Now accounting for the fact that you are using both a different compiler and architecture, these are not that different from your results. See in the generated code (below) the rcalls to subroutines such as __divsf3 - these are where the bulk of the code will be, and I suspect FP division is by far the larger contributor.
roundNo(unsigned int):
push r12
push r13
push r14
push r15
mov r22,r24
mov r23,r25
ldi r24,0
ldi r25,0
rcall __floatunsisf
ldi r18,0
ldi r19,0
ldi r20,0
ldi r21,lo8(69)
rcall __subsf3
ldi r18,0
ldi r19,0
ldi r20,lo8(-8)
ldi r21,lo8(66)
rcall __divsf3
mov r12,r22
mov r13,r23
mov r14,r24
mov r15,r25
ldi r18,0
ldi r19,0
ldi r20,0
ldi r21,0
rcall __ltsf2
ldi r18,0
ldi r19,0
ldi r20,0
ldi r21,lo8(63)
sbrs r24,7
rjmp .L6
mov r25,r15
mov r24,r14
mov r23,r13
mov r22,r12
rcall __subsf3
rjmp .L7
.L6:
mov r25,r15
mov r24,r14
mov r23,r13
mov r22,r12
rcall __addsf3
.L7:
rcall __fixsfsi
mov r24,r22
mov r25,r23
pop r15
pop r14
pop r13
pop r12
ret
You need to perform the same analysis on the code generated by your tool chain to answer your question. No doubt SDCC is capable of generating an assembly listing and a map file which will allow you to determine exactly what code and FP support is being generated and linked.
Ultimately though your use of FP in this case is entirely unnecessary:
int roundNo(uint16_t bit_input)
{
int s = (bit_input - ADC_MIN) ;
s += s < 0 ? -62 : 62 ;
return s / 124 ;
}
At Godbolt 2283 bytes compared to an empty function. Still somewhat large, but the issue there most likely is that the AVR lacks a DIV instruction so calls __divmodhi4. STM8 has a DIV for 16 bit dividend and 8 bit divisor, so it will likely be significantly smaller (and faster) on your target.

OK, a version of fixed point that actually works:
// Assume a 28.4 format for math. 12.4 can be used, but roundoff may occur.
// Input should be a literal float (Note that the multiply here will be handled by the
// compiler and not generate FP asm code.
#define TO_FIXED(x) (int)((x * 16))
// Takes a fixed and converts to an int - should turn into a right shift 4.
#define TO_INT(x) (int)((x / 16))
typedef int FIXED;
const uint16_t ADC_MIN = 32768;
int roundNo(uint16_t bit_input)
{
FIXED num = (TO_FIXED(bit_input - ADC_MIN)) / 124;
num += num < 0 ? TO_FIXED(-0.5) : TO_FIXED(0.5);
return TO_INT(num);
}
int main()
{
printf("%d", roundNo(0));
return 0;
}
Note we are using some 32-bit values here so it will be bigger than your current values. With care though, it could possibly convert back to a 12.4 (16-bit int) instead if round off and overflow can be managed carefully.
Or go grab a better full feature Fixed Point library from the web :)

(Update) After writing this, I noticed that #Clifford mentioned that your microcontroller supports this DIV instruction natively, in which case doing this is redundant. Anyway, I will leave it as a concept which can be applied in cases where DIV is implemented as an extern call, or for cases where DIV takes too many cycles and the goal is to make the calculation faster.
Anyway, shifting and adding is likely to be faster than division, if you ever need to squeeze some extra cycles. So if you start from the fact that 124 is almost equal to 4096/33 (the error factor is 0.00098, i.e. 0.098%, so less than 1 in 1000), you can implement the division with a single multiplication with 33 and a shift by 12 bits (division by 4096). Furthermore, 33 is 32+1, meaning multiplying by 33 is equal to shifting left by 5 and adding the input again.
Example: you want to divide 5000 by 124, and 5000/124 is approx. 40.323. What we will be doing is:
5,000 << 5 = 160,000
160,000 + 5,000 = 165,000
165,000 >> 12 = 40
Note that this only works for positive numbers. Also note that, if you're really doing lots of multiplications all over the code, then having a single extern mul or div function might result in smaller overall code in the long run, especially if the compiler is not particularly good at optimizing. And if the compiler can just emit a DIV instruction here, then the only thing you can get is a tiny bit of speed improvement, so don't bother with this.
#include <stdint.h>
#define ADC_MIN 2048
uint16_t roundNo(uint16_t bit_input)
{
// input too low, return zero
if (bit_input < ADC_MIN)
return 0;
bit_input -= (ADC_MIN - 62);
uint32_t x = bit_input;
// this gets us x = x * 33
x <<= 5;
x += bit_input;
// this gets us x = x / 4096
x >>= 12;
return (uint16_t)x;
}
GCC AVR with size optimizations produces this, i.e. all calls to extern mul or div functions are gone, but it seems like AVR doesn't support shifting multiple bits in a single instruction (it emits loops which shift 5 times and 12 times respectively). I don't have a clue what your compiler will do.
If you also need to handle the bit_input < ADC_MIN case, I would handle this part separately, i.e.:
#include <stdint.h>
#include <stdbool.h>
#define ADC_MIN 2048
int16_t roundNo(uint16_t bit_input)
{
// if subtraction would result in a negative value,
// handle it properly
bool negative = (bit_input < ADC_MIN);
bit_input = negative ? (ADC_MIN - bit_input) : (bit_input - ADC_MIN);
// we are always positive from this point on
bit_input -= (ADC_MIN - 62);
uint32_t x = bit_input;
x <<= 5;
x += bit_input;
x >>= 12;
return negative ? -(int16_t)x : (int16_t)x;
}

Why does storing a %x value to a variable do funky things when leading value is 8-f?

I am writing a program in c which imitates an LC-3 simulator. One of the objectives of this program is to store a 4 digit hexadecimal value from a file (0000 - ffff), and to convert it to binary, and interpret an LC-3 instruction from it. The following code segment shows how I am storing this value into a variable (which is where the problem seems to lie), and below that is the output I am receiving:
int *strstr(int s, char c);
void initialize_memory(int argc, char *argv[], CPU *cpu) {
FILE *datafile = get_datafile(argc, argv);
// Buffer to read next line of text into
#define DATA_BUFFER_LEN 256
char buffer[DATA_BUFFER_LEN];
int counter = 0;
// Will read the next line (words_read = 1 if it started
// with a memory value). Will set memory location loc to
// value_read
//
int value_read, words_read, loc = 0, done = 0;
char comment;
char *read_success; // NULL if reading in a line fails.
int commentLine =0;
read_success = fgets(buffer, DATA_BUFFER_LEN, datafile);
while (read_success != NULL && !done) {
// If the line of input begins with an integer, treat
// it as the memory value to read in. Ignore junk
// after the number and ignore blank lines and lines
// that don't begin with a number.
//
words_read = sscanf(buffer, "%04x%c", &value_read, &comment);
// if an integer was actually read in, then
// set memory value at current location to
// value_read and increment location. Exceptions: If
// loc is out of range, complain and quit the loop. If
// value_read is outside 0000 and ffff, then it's a
// sentinel -- we should say so and quit the loop.
if (value_read == NULL || comment ==';')
{
commentLine = 1;
}
if (value_read < -65536 || value_read > 65536)
{
printf("Sentinel read in place of Memory location %d: quitting loop\n", loc);
break;
}
else if (value_read >= -65536 && value_read <= 65536)
{
if (commentLine == 0)
{
if (counter == 0)
{
loc = value_read;
cpu -> memLocation = loc;
printf("\nPC location set to: x%04x \n\n", cpu -> memLocation);
counter++;
}
else
{
cpu -> mem[loc] = value_read;
printf("x%04x : x%d\t %04x \t ", loc,loc, cpu -> mem[loc]);
print_instr(cpu, cpu -> mem[loc]);
loc++;
value_read = NULL;
}
}
}
if (loc > 65536)
{
printf("Reached Memory limit, quitting loop.\n", loc);
break;
}
commentLine = 0;
read_success = fgets(buffer, DATA_BUFFER_LEN, datafile);
// Gets next line and continues the loop
}
fclose(datafile);
// Initialize rest of memory
while (loc < MEMLEN) {
cpu -> mem[loc++] = 0;
}
}
My aim is to show the Hex address : decimal address, the hex instruction, binary code, and then at the end, its LC-3 instruction translation. The data I am scanning from the file is the hex instruction:
x1000 : x4096 200c 0010000000001100 LD, R0, 12
x1001 : x4097 1221 0001001000100000 ADD, R1, R0, 0
x1002 : x4098 1401 0001010000000000 ADD, R2, R0, R0
x1003 : x4099 ffff94bf 0000000000000000 NOP
x1004 : x4100 166f 0001011001101110 ADD, R3, R1, 14
x1005 : x4101 1830 0001100000110000 ADD, R4, R0, -16
x1006 : x4102 1b04 0001101100000100 ADD, R5, R4, R4
x1007 : x4103 5d05 0101110100000100 AND, R6, R4, R4
x1008 : x4104 5e3f 0101111000111110 AND, R7, R0, -2
x1009 : x4105 5030 0101000000110000 AND, R0, R0, -16
x100a : x4106 52ef 0101001011101110 AND, R1, R3, 14
x100b : x4107 5fe0 0101111111100000 AND, R7, R7, 0
x100c : x4108 fffff025 0000000000000000 NOP
x100d : x4109 7fff 0111111111111110 STR, R7, R7, -2
As you can see, my problem lies in addresses x1003 and x100c;
As stated in the headline, when storing the hex instruction, if the value is between 8 and f, my best guess is that the scan is interpreting it as a negative value because of the leading value of the first hex digit in binary. If that is the case, it makes perfect sense, but is there a way I can bypass this? And if it isn't the case, what else could be causing this?
I found that if I pass value_read into print_instr() instead of cpu -> mem[loc], then the output works correctly. However, this is only a temporary fix as I need to store that value for later use in the program(for actual execution of the instruction). So the problem seems to arise while storing, and I am unsure as to why.
Additionally, (and this is a side question) though it is not a pressing concern, since I am using %x%c (value_read, comment) to store values from the file, I have been having trouble with the first few lines of the .hex file I am using, in which there is no hex value in the line, but instead just a comment symbol (for those unfamiliar with lc_3 simulators, the ';' is the symbol for comments). Whenever this is the case, I get a hex value of zero, although I wish for it to be NULL(In my program, I implemented a temporary solution because I am not sure how to fix it). I am not an expert in c just yet, and have not been able to find a solution to this problem. If you can help, it would be greatly appreciated, otherwise, it isn't a big issue for what I am trying to achieve with this program, it is more so just for my own knowledge and growth.
Thank you all in advance for your help :)

In a scanf family format string, the %x specifier means to read into an unsigned int. The corresponding argument must have exactly the type unsigned int *.
However you supply an argument of type int *.
This causes undefined behaviour. What you are seeing is the chance interaction between library elements that expect you to follow the rules, and your code that didn't follow the rules.
To fix it, follow the rules. For example, read into an unsigned int variable.
NB. 0 does nothing in the scanf format string; %04x is equivalent to %4x.

May I suppose that cpu->mem is of type array of short or alike? Then sign extension occurs when printing cpu->mem[loc]. Remind that arguments are at least converted to int at printf calls. Symptom is the same as in the following code:
int i;
scanf("%4x",&i);
printf("%x\n",i);
short s = i;
printf("--> %x\n",s);
The short equals to -1 then when you set it to an int it is converted to -1, 0xffffffff (if 32-bits).
Use unsigned short in place.

writing a function in ARM assembly language which inserts a string into another string at a specific location

I was going through one of my class's textbooks and I stumbled upon this problem:
Write a function in ARM assembly language which will insert a string into another string at a specific location. The function is:
char * csinsert( char * s1, char * s2, int loc ) ;
The function has a pointer to s1 in a1, a pointer to s2 in a2, and an integer in a3 as to where the insertion takes place. The function returns a pointer in a1 to the new string.
You can use the library functions strlen and malloc.
strlen has as input the pointer to the string in a1 and returns the length in a1.
malloc will allocate space for the new string where a1 on input is the size in bytes of the space requested and on output a1 is a pointer to the requested space.
Remember the registers a1-a4 do not retain their values across function calls.
This is the C language driver for the string insert I created:
#include <stdio.h>
extern char * csinsert( char * s1, char * s2, int loc ) ;
int main( int argc, char * argv[] )
{
char * s1 = "String 1 are combined" ;
char * s2 = " and string 2 " ;
int loc = 8 ;
char * result ;
result = csinsert( s1, s2, loc ) ;
printf( "Result: %s\n", result ) ;
}
My assembly language code so far is:
.global csinsert
csinsert:
stmfd sp!, {v1-v6, lr}
mov v1, a1
bl strlen
add a1, a1, #1
mov v2, a1
add a2, a2
mov v3, a2
add a3, a3
bl malloc
mov v3, #0
loop:
ldrb v4, [v1], #1
subs v2, v2, #1
add v4, v4, a2
strb v4, [a1], #1
bne loop
ldmfd sp!, {v1-v6, pc} #std
.end
I don't think my code works properly. When I link the two finals, there is no result given back. Why does my code not insert the string properly? I believe the issue is in the assembly program, is it not returning anything?
Can anyone explain what my mistake is? I'm not sure how to use the library functions the question hints to.
Thanks!

Caveat: I've been doing asm for 40+, I've looked at arm a bit, but not used it. However, I pulled the arm ABI document.
As the problem stated, a1-a4 are not preserved across a call, which matches the ABI. You saved your a1, but you did not save your a2 or a3.
strlen [or any other function] is permitted to use a1-a4 as scratch regs. So, for efficiency, my guess is that strlen [or malloc] is using a2-a4 as scratch and [from your perspective] corrupting some of the register values.
By the time you get to loop:, a2 is probably a bogus journey :-)
UPDATE
I started to clean up your asm. Style is 10x more important in asm than C. Every asm line should have a sidebar comment. And add a blank line here or there. Because you didn't post your updated code, I had to guess at the changes and after a bit, I realized you only had about 25% or so. Plus, I started to mess things up.
I split the problem into three parts:
- Code in C
- Take C code and generate arm pseudo code in C
- Code in asm
If you take a look at the C code and pseudo code, you'll notice that any misuse of instructions aside, your logic was wrong (e.g. you needed two strlen calls before the malloc)
So, here is your assembler cleaned for style [not much new code]. Notice that I may have broken some of your existing logic, but my version may be easier on the eyes. I used tabs to separate things and got everything to line up. That can help. Also, the comments show intent or note limitations of instructions or architecture.
.global csinsert
csinsert:
stmfd sp!,{v1-v6,lr} // preserve caller registers
// preserve our arguments across calls
mov v1,a1
mov v2,a2
mov v3,a3
// get length of destination string
mov a1,v1 // set dest addr as strlen arg
bl strlen // call strlen
add a1,a1,#1 // increment length
mov v4,a1 // save it
add v3,v3 // src = src + src (what???)
mov v5,v2 // save it
add v3,v3 // double the offset (what???)
bl malloc // get heap memory
mov v4,#0 // set index for loop
loop:
ldrb v7,[v1],#1
subs v2,v2,#1
add v7,v7,a2
strb v7,[a1],#1
bne loop
ldmfd sp!,{v1-v6,pc} #std // restore caller registers
.end
At first, you should prototype in real C:
// csinsert_real -- real C code
char *
csinsert_real(char *s1,char *s2,int loc)
{
int s1len;
int s2len;
char *bp;
int chr;
char *bf;
s1len = strlen(s1);
s2len = strlen(s2);
bf = malloc(s1len + s2len + 1);
bp = bf;
// copy over s1 up to but not including the "insertion" point
for (; loc > 0; --loc, ++s1, ++bp) {
chr = *s1;
if (chr == 0)
break;
*bp = chr;
}
// "insert" the s2 string
for (chr = *s2++; chr != 0; chr = *s2++, ++bp)
*bp = chr;
// copy the remainder of s1 [if any]
for (chr = *s1++; chr != 0; chr = *s1++, ++bp)
*bp = chr;
*bp = 0;
return bf;
}
Then, you can [until you're comfortable with arm], prototype in C "pseudocode":
// csinsert_pseudo -- pseudo arm code
char *
csinsert_pseudo()
{
// save caller registers
v1 = a1;
v2 = a2;
v3 = a3;
a1 = v1;
strlen();
v4 = a1;
a1 = v2;
strlen();
a1 = a1 + v4 + 1;
malloc();
v5 = a1;
// NOTE: load/store may only use r0-r7
// and a1 is r0
#if 0
r0 = a1;
#endif
r1 = v1;
r2 = v2;
// copy over s1 up to but not including the "insertion" point
loop1:
if (v3 == 0) goto eloop1;
r3 = *r1;
if (r3 == 0) goto eloop1;
*r0 = r3;
++r0;
++r1;
--v3;
goto loop1;
eloop1:
// "insert" the s2 string
loop2:
r3 = *r2;
if (r3 == 0) goto eloop2;
*r0 = r3;
++r0;
++r2;
goto loop2;
eloop2:
// copy the remainder of s1 [if any]
loop3:
r3 = *r1;
if (r3 == 0) goto eloop3;
*r0 = r3;
++r0;
++r1;
goto loop3;
eloop3:
*r0 = 0;
a1 = v5;
// restore caller registers
}

Smallest method of turning a string into an integer(and vice-versa)

I am looking for an extremely small way of turning a string like "123" into an integer like 123 and vice-versa.
I will be working in a freestanding environment. This is NOT a premature optimization. I am creating code that must fit in 512 bytes, so every byte does actually count. I will take both x86 assembly(16 bit) and C code though(as that is pretty easy to convert)
It does not need to do any sanity checks or anything..
I thought I had seen a very small C implementation implemented recursively, but I can't seem to find anything for size optimization..
So can anyone find me(or create) a very small atoi/itoa implementation? (it only needs to work with base 10 though)
Edit: (the answer) (edited again because the first code was actually wrong)
in case someone else comes upon this, this is the code I ended up creating. It could fit in 21 bytes!
;ds:bx is the input string. ax is the returned integer
_strtoint:
xor ax,ax
.loop1:
imul ax, 10 ;ax serves as our temp var
mov cl,[bx]
mov ch,0
add ax,cx
sub ax,'0'
inc bx
cmp byte [bx],0
jnz .loop1
ret
Ok, last edit I swear!
Version weighing in at 42 bytes with negative number support.. so if anyone wants to use these they can..
;ds:bx is the input string. ax is the returned integer
_strtoint:
cmp byte [bx],'-'
je .negate
;rewrite to negate DX(just throw it away)
mov byte [.rewrite+1],0xDA
jmp .continue
.negate:
mov byte [.rewrite+1],0xD8
inc bx
.continue
xor ax,ax
.loop1:
imul ax, 10 ;ax serves as our temp var
mov dl,[bx]
mov dh,0
add ax,dx
sub ax,'0'
inc bx
cmp byte [bx],0
jnz .loop1
;popa
.rewrite:
neg ax ;this instruction gets rewritten to conditionally negate ax or dx
ret

With no error checking, 'cause that's for wussies who have more than 512B to play with:
#include <ctype.h>
// alternative:
// #define isdigit(C) ((C) >= '0' && (C) <= '9')
unsigned long myatol(const char *s) {
unsigned long n = 0;
while (isdigit(*s)) n = 10 * n + *s++ - '0';
return n;
}
gcc -O2 compiles this into 47 bytes, but the external reference to __ctype_b_loc is probably more than you can afford...

I don't have an assembler on my laptop to check the size, but offhand, it seems like this should be shorter:
; input: zero-terminated string in DS:SI
; result: AX
atoi proc
xor cx, cx
mov ax, '0'
##:
imul cx, 10
sub al, '0'
add cx, ax
lodsb
jnz #b
xchg ax, cx
ret
atoi endp

Write it yourself. Note that subtracting '0' from a digit gets the power-of-ten. So, you loop down the digits, and every time you multiply the value so far by 10, subtract '0' from the current character, and add it. Codable in assembly in no time flat.

atoi(p)
register char *p;
{
register int n;
register int f;
n = 0;
f = 0;
for(;;p++) {
switch(*p) {
case ' ':
case '\t':
continue;
case '-':
f++;
case '+':
p++;
}
break;
}
while(*p >= '0' && *p <= '9')
n = n*10 + *p++ - '0';
return(f? -n: n);
}

And here is another one without any checking. It assumes a null terminated string. As a bonus, it checks for a negative sign. This takes 593 bytes with a Microsoft compiler (cl /O1).
int myatoi( char* a )
{
int res = 0;
int neg = 0;
if ( *a == '-' )
{
neg = 1;
a++;
}
while ( *a )
{
res = res * 10 + ( *a - '0' );
a++;
}
if ( neg )
res *= -1;
return res;
}

Are any of the sizes smaller if you use -Os (optimize for space) instead of -O2 ?

You could try packing the string into BCD(0x1234) and then using x87 fbld and fist instructions for a 1980s solution but I am not sure that will be smaller at all as I don't remember there being any packing instruction.

How in the world are you people getting the executables so small?! This code generates a 316 byte .o file when compiled with gcc -Os -m32 -c -o atoi.o atoi.c and a 8488 byte executable when compiled and linked (with an empty int main(){} added) with gcc -Os -m32 -o atoi atoi.c. This is on Mac OS X Snow Leopard...
int myatoi(char *s)
{
short retval=0;
for(;*s!=0;s++) retval=retval*10+(*s-'0');
return retval;
}