Pointer and regular variable with the same name

Pointer and regular variable with the same name - c

In the "Memory and pointers" chapter of the book "Embedded C programming", Mark Siegesmund gives the following example:
void Find_Min_Max( int list[], int count, int * min, int * max){
for( int i=0, min=255, max=0; i<count; i++){
if( *min>list[i])
*min=list[i];
if( *max<list[i] )
*max=list[i];
}
}
// call like this: Find_Min_Max( table, sizeof(table), &lowest, &highest);
If I understand correctly:
table is an array of int,
count is the size of the array
when calling, &lowest and &highest are addresses of int variables where one wants to store the results
int * min and int * max in the function definition refer to pointers * min and * max, both to integer types
the &lowest and &highest as defined in his last line are actually pointers to an address with an int type variable
(Not 100% sure about that last one.)
Inside the for loop, he each time compares the next int in the array list with what is at the pointer *min and *max addresses and updates the values at those addresses if necessary.
But in the definition of the loop, he defines min = 255, max = 0.
It seems to me as if these are two completely new variables that have not been initialized.
Should that line not be
for( int i=0, *min=255, *max=0: i<count; i++){
Is that a mistake in the book or something I misunderstand?

It seems like an error in the book - it does indeed declare new variables inside the loop. (Hint: before publishing programming books, at least compile the code first...)
But even with that embarrassing bug fixed, the code is naively written. Here's more mistakes:
Always const qualify array parameters that aren't modified.
Always use stdint.h in embedded systems.
Never use magic numbers like 255. In this case, use UINT8_MAX instead.
The above is industry standard consensus. (Also required by MISRA-C etc.)
Also, it is most correct practice to use size_t rather than int for size of arrays, but that's more of a style issue remark.
In addition, a better algorithm is to have the pointers point at the min and max values found in the array, meaning we don't just get the values but also their locations in the data container. Finding the location is a very common use-case. It's about the same execution speed but we get more information.
So if we should attempt to rewrite this to some book-worthy code, it would rather look like:
void find_min_max (const uint8_t* data, size_t size, const uint8_t** min, const uint8_t** max);
Bit harder to read and use with pointer-to-pointers, but more powerful.
(Normally we would micro-optimize pointers of same type with restrict, but in this case all pointers may end up pointing at the same object, so it isn't possible.)
Complete example:
#include <stddef.h>
#include <stdint.h>
void find_min_max (const uint8_t* data, size_t size, const uint8_t** min, const uint8_t** max)
{
*min = data;
*max = data;
for(size_t i=0; i<size; i++)
{
if(**min > data[i])
{
*min = &data[i];
}
if(**max < data[i])
{
*max = &data[i];
}
}
}
Usage example for PC: (please note that int main (void) and stdio.h shouldn't be used in embedded systems.)
#include <stdio.h>
#include <inttypes.h>
int main (void)
{
const uint8_t data[] = { 1, 2, 3, 4, 5, 4, 3, 2, 1, 0};
const uint8_t* min;
const uint8_t* max;
find_min_max(data, sizeof data, &min, &max);
printf("Min: %"PRIu8 ", index: %d\n", *min, (int)(min-data));
printf("Max: %"PRIu8 ", index: %d\n", *max, (int)(max-data));
return 0;
}
Disassembling this search algorithm for ARM gcc -O3:
find_min_max:
cmp r1, #0
str r0, [r2]
str r0, [r3]
bxeq lr
push {r4, lr}
add r1, r0, r1
.L5:
mov lr, r0
ldr ip, [r2]
ldrb r4, [ip] # zero_extendqisi2
ldrb ip, [r0], #1 # zero_extendqisi2
cmp r4, ip
strhi lr, [r2]
ldr r4, [r3]
ldrbhi ip, [r0, #-1] # zero_extendqisi2
ldrb r4, [r4] # zero_extendqisi2
cmp r4, ip
strcc lr, [r3]
cmp r1, r0
bne .L5
pop {r4, pc}
Not the most effective code still, very branch-intensive. I think there's plenty of room for optimizing this further, if the aim is library-quality code. But then it's also a specialized algorithm, finding both min and max, and their respective indices.
For small data sets it would probably be wiser to just sort the data first, then just grab min and max out of the sorted smallest and largest indices. If you plan to search through the data for other purposes elsewhere in the code, then definitely sort it first, so that you can use binary search.

int i=0, min=255, max=0 and int i=0, *min=255, *max=0 both define three new variables, which are initialized, but used incorrectly in the loop body.
The limits should be initialized before the loop:
*min=255;
*max=0;
for(int i=0; i<count; i++)
Alternatively the new variable i could be defined before the loop, but this is not as easy to read as the first one:
int i;
for(i=0, *min=255, *max=0; i<count; i++)
Note that if there are values smaller than 0 or larger than 255, the returned minimum and maximum values will be incorrect.

It is some mistake in the code. Maybe it should read:
for( int i=0, *min=255, *max=0; i<count; i++){
if( *min>list[i])
*min=list[i];
if( *max<list[i] )
*max=list[i];
}

Related

How can I make this loop run faster?

I'm using this code to find the highest temperature pixel in a thermal image and the coordinates of the pixel.
void _findMax(uint16_t *image, int sz, sPixelData *returnPixel)
{
int temp = 0;
for (int i = sz; i > 0; i--)
{
if (returnPixel->temperature < *image)
{
returnPixel->temperature = *image;
temp = i;
}
image++;
}
returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
}
With an image size of 640x480 it takes around 35ms to run through this function, which is too slow for what I need it for (under 10ms ideally).
This is executing on an ARM A9 processor running Linux.
The compiler I'm using is ARM v8 32-Bit Linux gcc compiler.
I'm using optimize -O3 and the following compile options: -march=armv7-a+neon -mcpu=cortex-a9 -mfpu=neon-fp16 -ftree-vectorize.
This is the output from the compiler:
000127f4 <_findMax>:
for(int i = sz; i > 0; i--)
127f4: e3510000 cmp r1, #0
{
127f8: e52de004 push {lr} ; (str lr, [sp, #-4]!)
for(int i = sz; i > 0; i--)
127fc: da000014 ble 12854 <_findMax+0x60>
12800: e1d2c0b0 ldrh ip, [r2]
12804: e2400002 sub r0, r0, #2
int temp = 0;
12808: e3a0e000 mov lr, #0
if(returnPixel->temperature < *image)
1280c: e1f030b2 ldrh r3, [r0, #2]!
12810: e153000c cmp r3, ip
returnPixel->temperature = *image;
12814: 81a0c003 movhi ip, r3
12818: 81a0e001 movhi lr, r1
1281c: 81c230b0 strhhi r3, [r2]
for(int i = sz; i > 0; i--)
12820: e2511001 subs r1, r1, #1
12824: 1afffff8 bne 1280c <_findMax+0x18>
12828: e30c3ccd movw r3, #52429 ; 0xcccd
1282c: e34c3ccc movt r3, #52428 ; 0xcccc
12830: e0831e93 umull r1, r3, r3, lr
12834: e1a034a3 lsr r3, r3, #9
12838: e0831103 add r1, r3, r3, lsl #2
1283c: e6ff3073 uxth r3, r3
12840: e04ee381 sub lr, lr, r1, lsl #7
12844: e6ffe07e uxth lr, lr
returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
12848: e1c2e0b4 strh lr, [r2, #4]
returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
1284c: e1c230b6 strh r3, [r2, #6]
}
12850: e49df004 pop {pc} ; (ldr pc, [sp], #4)
for(int i = sz; i > 0; i--)
12854: e3a03000 mov r3, #0
12858: e1a0e003 mov lr, r3
1285c: eafffff9 b 12848 <_findMax+0x54>
For clarity after comments:
Each pixel is a unsigned 16 bit integer, image[0] would be the pixel with coordinates 0,0, and the last in the array would have the coordinates 639,479.

This is executing on an ARM A9 processor running Linux.
ARM Cortex-A9 supports Neon.
With this in mind the goal should be to load 8 values (128 bits of pixel data) into a register, then do "compare with the current maximums for each of the 8 places" to get a mask, then use the mask and its inverse to mask out the "too small" old maximums and the "too small" new values; then OR the results to merge the new higher values into the "current maximums for each of the 8 places".
Once that has been done for all pixels (using a loop); you'd want to find the highest value in the "current maximums for each of the 8 places".
However; to find the location of the hottest pixel (rather than just how hot it is) you'd want to split the image into tiles (e.g. maybe 8 pixels wide and 8 pixels tall). This allows you to find the max. temperature within each tile (using Neon); then find the pixel within the hottest tile. Note that for huge images this lends itself to a "multi-layer" approach - e.g. create a smaller image containing the maximum from each tile in the original image; then do the same thing again to create an even smaller image containing the maximum from each "group of tiles", then ...
Making this work in plain C means trying to convince the compiler to auto-vectorize. The alternatives are to use compiler intrinsics or inline assembly. In any of these cases, using Neon to do 8 pixels in parallel (without any branches) could/should improve performance significantly (how much depends on RAM bandwidth).

You should minimize memory access, especially in loops.
Every * or -> could cause unnecessary memory access resulting in serious performance hits.
Local variables are your best friend:
void _findMax(uint16_t *image, int sz, sPixelData *returnPixel)
{
int temp = 0;
uint16_t temperature = returnPixel->temperature;
uint16_t pixel;
for (int i = sz; i > 0; i--)
{
pixel = *image++;
if (temperature < pixel)
{
temperature = pixel;
temp = i;
}
}
returnPixel->temperature = temperature;
returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
}
Below is how this can be optimized by utilizing neon:
#include <stdint.h>
#include <arm_neon.h>
#include <assert.h>
static inline void findMax128_neon(uint16_t *pDst, uint16x8_t *pImage)
{
uint16x8_t in0, in1, in2, in3, in4, in5, in6, in7, in8, in9, in10, in11, in12, in13, in14, in15;
uint16x4_t dmax;
in0 = vld1q_u16(pImage++);
in1 = vld1q_u16(pImage++);
in2 = vld1q_u16(pImage++);
in3 = vld1q_u16(pImage++);
in4 = vld1q_u16(pImage++);
in5 = vld1q_u16(pImage++);
in6 = vld1q_u16(pImage++);
in7 = vld1q_u16(pImage++);
in8 = vld1q_u16(pImage++);
in9 = vld1q_u16(pImage++);
in10 = vld1q_u16(pImage++);
in11 = vld1q_u16(pImage++);
in12 = vld1q_u16(pImage++);
in13 = vld1q_u16(pImage++);
in14 = vld1q_u16(pImage++);
in15 = vld1q_u16(pImage);
in0 = vmaxq_u16(in1, in0);
in2 = vmaxq_u16(in3, in2);
in4 = vmaxq_u16(in5, in4);
in6 = vmaxq_u16(in7, in6);
in8 = vmaxq_u16(in9, in8);
in10 = vmaxq_u16(in11, in10);
in12 = vmaxq_u16(in13, in12);
in14 = vmaxq_u16(in15, in14);
in0 = vmaxq_u16(in2, in0);
in4 = vmaxq_u16(in6, in4);
in8 = vmaxq_u16(in10, in8);
in12 = vmaxq_u16(in14, in12);
in0 = vmaxq_u16(in4, in0);
in8 = vmaxq_u16(in12, in8);
in0 = vmaxq_u16(in8, in0);
dmax = vmax_u16(vget_high_u16(in0), vget_low_u16(in0));
dmax = vpmax_u16(dmax, dmax);
dmax = vpmax_u16(dmax, dmax);
vst1_lane_u16(pDst, dmax, 0);
}
void _findMax_neon(uint16_t *image, int sz, sPixelData *returnPixel)
{
assert((sz % 128) == 0);
const uint32_t nSector = sz/128;
uint16_t max[nSector];
uint32_t i, s, nMax;
uint16_t *pImage;
for (i = 0; i < nSector; ++i)
{
findMax128_neon(&max[i], (uint16x8_t *) &image[i*128]);
}
s = 0;
nMax = max[0];
for (i = 1; i < nSector; ++i)
{
if (max[i] > nMax)
{
s = i;
nMax = max[i];
}
}
if (nMax < returnPixel->temperature)
{
returnPixel->x_location = 0;
returnPixel->y_location = 0;
return;
}
pImage = &image[s];
i = 0;
while(1) {
if (*pImage++ == nMax) break;
i += 1;
}
i += 128 * s;
returnPixel->temperature = nMax;
returnPixel->x_location = i % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = i / IMAGE_HORIZONTAL_SIZE;
}
Beware that the function above assumes sz being a multiple of 128.
And yes, it will run in less than 10ms.

The culprit here is the slow linear search for the highest "temperature". I'm not quite sure how to improve that search algorithm with the information given, if at all possible (could you sort the data in advance?), but you could start with this:
uint16_t max = 0;
size_t found_index = 0;
for(size_t i=0; i<sz; i++)
{
if(max < image[i])
{
max = image[i];
found_index = sz - i - 1; // or whatever makes sense here for the algorithm
}
}
returnPixel->temperature = max;
returnPixel->x_location = found_index % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = found_index / IMAGE_HORIZONTAL_SIZE;
This might give a very slight performance gain because of the top to bottom iteration order and not touching unrelated memory returnPixel in the middle of the loop. max should get stored in a register and with luck you might get slightly better cache performance overall. Still, this comes with a branch like the original code, so it is a minor improvement.
Another micro-optimization is to change the parameter to const uint16_t* image - this might give slightly better pointer aliasing, in case returnPixel happens to contain a uint16_t too. image should be const regardless of performance, since const correctness is always good practice.
Further more obscure optimization tricks might be possible if you read image 32 or 64 bits at a time, then come up with a fast look-up method to find the largest image inside that 32/64 bit chunk.

If you have to find the hottest pixel in the image and if there is no structure to the image data itself then I think your stuck with iterating through the pixels. If so you have a number of different ways to make this faster:
As suggested above try loop unrolling and other micro-optimisation tricks, this might give you the performance boost you need
Go parallel, split the array into N chunks and find the MAX[N] for each chunk and then find the largest of the MAX[N] values. You have to be careful here as setting up the parallel processes can take longer than doing the work.
If there is some structure to the image, lots of cold pixels and a hot spot (larger than 1 pixel) that your trying to find say, then there are other techniques you could use.
One approach could be to split the image up into N boxes and then sample each box, The hottest pixel in the hottest box (and maybe in the boxes adjacent too it) would then be your result. However this depends on their being some structure to the image which you can rely on.

The assembly language reveals the compiler is storing in returnPixel->temperature each time a new maximum is found, with the instruction strhhi r3, [r2]. Eliminate this unnecessary store by caching the maximum in a local object and only updating returnPixel->temperature after the loop ends:
uint16_t maximum = returnPixel->temperature;
for (int i = sz; i > 0; i--)
{
if (maximum < *image)
{
maximum = *image;
temp = i;
}
image++;
}
returnPixel->temperature = maximum;
That is unlikely to reduce the execution time as much as you need, but it might if there is some bad cache or memory interaction occurring. It is a very simple change, so try it before moving on to the SIMD vectorizations suggested in other answers.
Regarding vectorization, two approaches are:
Iterate through the image using a vmax instruction to update the maximum value seen so far in each SIMD lane. Then consolidate the lanes to find the overall maximum. Then iterate through the image again looking for that maximum in any lane. (I forget what that architecture has for instructions that would assist in testing whether a comparison produced true in any lane.)
Iterate through the image maintaining three registers: One with the maximum seen so far in each lane, one with a counter of position in the image, and one with, in each lane, a record of the counter value at the time each new maximum was seen. The first can be updated with vmax, as above. The second can be updated with vadd. The third can be updated with vcmp and vbit. After the loop, figure out which lane has the maximum, and get the position of the maximum from the recorded counter for that lane.
Depending on the performance of the necessary instructions, a hybrid approach may be faster:
Set some strip size S. Partition the image into strips of that size. For each strip in the image, find the maximum (using the fast vmax loop described above). If the maximum is greater than seen in previous strips, remember it and the current strip number. After processing the whole image, the task has been reduced to finding the location of the maximum in a particular strip. Use the second loop from the first approach above for that. (For large images, further refinements may be possible, possibly refining the location using a shorter strip size before finding the exact location, depending on cache behavior and other factors.)

In my opinion you cannot improve that algorithim, because any array element can hold the maximum, so you need to do at least one pass through the data, I don't believe you can improve this without going multithreading. you can start several threads (as many as cores/processors you may have) and give each a subset of your image. Once they are finished, you will have as many local maximum values as the number of threads you started. just do a second pass on those vaues to get the maximum total value, and you are finished. But consider the extra workload of creating threads, allocating stack memory for them, and scheduling, as that can be higher if the number of values is low than the workload of running all in a single thread. If you have a thread pool somewhere to provide ready to run threads, and that is something you can get on, then probably you'll be able to finished in one Nth part of the time to run all the loop in one single processor (where N is the number of cores you have on the machine)
Note: using a dimension that is a power of two will save you the job of calculating a quotient and a remainder by solving the division problem with a bit shift and a bit mask. You use it only once in your function, but it's an improving, anyway.

Why does storing a %x value to a variable do funky things when leading value is 8-f?

I am writing a program in c which imitates an LC-3 simulator. One of the objectives of this program is to store a 4 digit hexadecimal value from a file (0000 - ffff), and to convert it to binary, and interpret an LC-3 instruction from it. The following code segment shows how I am storing this value into a variable (which is where the problem seems to lie), and below that is the output I am receiving:
int *strstr(int s, char c);
void initialize_memory(int argc, char *argv[], CPU *cpu) {
FILE *datafile = get_datafile(argc, argv);
// Buffer to read next line of text into
#define DATA_BUFFER_LEN 256
char buffer[DATA_BUFFER_LEN];
int counter = 0;
// Will read the next line (words_read = 1 if it started
// with a memory value). Will set memory location loc to
// value_read
//
int value_read, words_read, loc = 0, done = 0;
char comment;
char *read_success; // NULL if reading in a line fails.
int commentLine =0;
read_success = fgets(buffer, DATA_BUFFER_LEN, datafile);
while (read_success != NULL && !done) {
// If the line of input begins with an integer, treat
// it as the memory value to read in. Ignore junk
// after the number and ignore blank lines and lines
// that don't begin with a number.
//
words_read = sscanf(buffer, "%04x%c", &value_read, &comment);
// if an integer was actually read in, then
// set memory value at current location to
// value_read and increment location. Exceptions: If
// loc is out of range, complain and quit the loop. If
// value_read is outside 0000 and ffff, then it's a
// sentinel -- we should say so and quit the loop.
if (value_read == NULL || comment ==';')
{
commentLine = 1;
}
if (value_read < -65536 || value_read > 65536)
{
printf("Sentinel read in place of Memory location %d: quitting loop\n", loc);
break;
}
else if (value_read >= -65536 && value_read <= 65536)
{
if (commentLine == 0)
{
if (counter == 0)
{
loc = value_read;
cpu -> memLocation = loc;
printf("\nPC location set to: x%04x \n\n", cpu -> memLocation);
counter++;
}
else
{
cpu -> mem[loc] = value_read;
printf("x%04x : x%d\t %04x \t ", loc,loc, cpu -> mem[loc]);
print_instr(cpu, cpu -> mem[loc]);
loc++;
value_read = NULL;
}
}
}
if (loc > 65536)
{
printf("Reached Memory limit, quitting loop.\n", loc);
break;
}
commentLine = 0;
read_success = fgets(buffer, DATA_BUFFER_LEN, datafile);
// Gets next line and continues the loop
}
fclose(datafile);
// Initialize rest of memory
while (loc < MEMLEN) {
cpu -> mem[loc++] = 0;
}
}
My aim is to show the Hex address : decimal address, the hex instruction, binary code, and then at the end, its LC-3 instruction translation. The data I am scanning from the file is the hex instruction:
x1000 : x4096 200c 0010000000001100 LD, R0, 12
x1001 : x4097 1221 0001001000100000 ADD, R1, R0, 0
x1002 : x4098 1401 0001010000000000 ADD, R2, R0, R0
x1003 : x4099 ffff94bf 0000000000000000 NOP
x1004 : x4100 166f 0001011001101110 ADD, R3, R1, 14
x1005 : x4101 1830 0001100000110000 ADD, R4, R0, -16
x1006 : x4102 1b04 0001101100000100 ADD, R5, R4, R4
x1007 : x4103 5d05 0101110100000100 AND, R6, R4, R4
x1008 : x4104 5e3f 0101111000111110 AND, R7, R0, -2
x1009 : x4105 5030 0101000000110000 AND, R0, R0, -16
x100a : x4106 52ef 0101001011101110 AND, R1, R3, 14
x100b : x4107 5fe0 0101111111100000 AND, R7, R7, 0
x100c : x4108 fffff025 0000000000000000 NOP
x100d : x4109 7fff 0111111111111110 STR, R7, R7, -2
As you can see, my problem lies in addresses x1003 and x100c;
As stated in the headline, when storing the hex instruction, if the value is between 8 and f, my best guess is that the scan is interpreting it as a negative value because of the leading value of the first hex digit in binary. If that is the case, it makes perfect sense, but is there a way I can bypass this? And if it isn't the case, what else could be causing this?
I found that if I pass value_read into print_instr() instead of cpu -> mem[loc], then the output works correctly. However, this is only a temporary fix as I need to store that value for later use in the program(for actual execution of the instruction). So the problem seems to arise while storing, and I am unsure as to why.
Additionally, (and this is a side question) though it is not a pressing concern, since I am using %x%c (value_read, comment) to store values from the file, I have been having trouble with the first few lines of the .hex file I am using, in which there is no hex value in the line, but instead just a comment symbol (for those unfamiliar with lc_3 simulators, the ';' is the symbol for comments). Whenever this is the case, I get a hex value of zero, although I wish for it to be NULL(In my program, I implemented a temporary solution because I am not sure how to fix it). I am not an expert in c just yet, and have not been able to find a solution to this problem. If you can help, it would be greatly appreciated, otherwise, it isn't a big issue for what I am trying to achieve with this program, it is more so just for my own knowledge and growth.
Thank you all in advance for your help :)

In a scanf family format string, the %x specifier means to read into an unsigned int. The corresponding argument must have exactly the type unsigned int *.
However you supply an argument of type int *.
This causes undefined behaviour. What you are seeing is the chance interaction between library elements that expect you to follow the rules, and your code that didn't follow the rules.
To fix it, follow the rules. For example, read into an unsigned int variable.
NB. 0 does nothing in the scanf format string; %04x is equivalent to %4x.

May I suppose that cpu->mem is of type array of short or alike? Then sign extension occurs when printing cpu->mem[loc]. Remind that arguments are at least converted to int at printf calls. Symptom is the same as in the following code:
int i;
scanf("%4x",&i);
printf("%x\n",i);
short s = i;
printf("--> %x\n",s);
The short equals to -1 then when you set it to an int it is converted to -1, 0xffffffff (if 32-bits).
Use unsigned short in place.

writing a function in ARM assembly language which inserts a string into another string at a specific location

I was going through one of my class's textbooks and I stumbled upon this problem:
Write a function in ARM assembly language which will insert a string into another string at a specific location. The function is:
char * csinsert( char * s1, char * s2, int loc ) ;
The function has a pointer to s1 in a1, a pointer to s2 in a2, and an integer in a3 as to where the insertion takes place. The function returns a pointer in a1 to the new string.
You can use the library functions strlen and malloc.
strlen has as input the pointer to the string in a1 and returns the length in a1.
malloc will allocate space for the new string where a1 on input is the size in bytes of the space requested and on output a1 is a pointer to the requested space.
Remember the registers a1-a4 do not retain their values across function calls.
This is the C language driver for the string insert I created:
#include <stdio.h>
extern char * csinsert( char * s1, char * s2, int loc ) ;
int main( int argc, char * argv[] )
{
char * s1 = "String 1 are combined" ;
char * s2 = " and string 2 " ;
int loc = 8 ;
char * result ;
result = csinsert( s1, s2, loc ) ;
printf( "Result: %s\n", result ) ;
}
My assembly language code so far is:
.global csinsert
csinsert:
stmfd sp!, {v1-v6, lr}
mov v1, a1
bl strlen
add a1, a1, #1
mov v2, a1
add a2, a2
mov v3, a2
add a3, a3
bl malloc
mov v3, #0
loop:
ldrb v4, [v1], #1
subs v2, v2, #1
add v4, v4, a2
strb v4, [a1], #1
bne loop
ldmfd sp!, {v1-v6, pc} #std
.end
I don't think my code works properly. When I link the two finals, there is no result given back. Why does my code not insert the string properly? I believe the issue is in the assembly program, is it not returning anything?
Can anyone explain what my mistake is? I'm not sure how to use the library functions the question hints to.
Thanks!

Caveat: I've been doing asm for 40+, I've looked at arm a bit, but not used it. However, I pulled the arm ABI document.
As the problem stated, a1-a4 are not preserved across a call, which matches the ABI. You saved your a1, but you did not save your a2 or a3.
strlen [or any other function] is permitted to use a1-a4 as scratch regs. So, for efficiency, my guess is that strlen [or malloc] is using a2-a4 as scratch and [from your perspective] corrupting some of the register values.
By the time you get to loop:, a2 is probably a bogus journey :-)
UPDATE
I started to clean up your asm. Style is 10x more important in asm than C. Every asm line should have a sidebar comment. And add a blank line here or there. Because you didn't post your updated code, I had to guess at the changes and after a bit, I realized you only had about 25% or so. Plus, I started to mess things up.
I split the problem into three parts:
- Code in C
- Take C code and generate arm pseudo code in C
- Code in asm
If you take a look at the C code and pseudo code, you'll notice that any misuse of instructions aside, your logic was wrong (e.g. you needed two strlen calls before the malloc)
So, here is your assembler cleaned for style [not much new code]. Notice that I may have broken some of your existing logic, but my version may be easier on the eyes. I used tabs to separate things and got everything to line up. That can help. Also, the comments show intent or note limitations of instructions or architecture.
.global csinsert
csinsert:
stmfd sp!,{v1-v6,lr} // preserve caller registers
// preserve our arguments across calls
mov v1,a1
mov v2,a2
mov v3,a3
// get length of destination string
mov a1,v1 // set dest addr as strlen arg
bl strlen // call strlen
add a1,a1,#1 // increment length
mov v4,a1 // save it
add v3,v3 // src = src + src (what???)
mov v5,v2 // save it
add v3,v3 // double the offset (what???)
bl malloc // get heap memory
mov v4,#0 // set index for loop
loop:
ldrb v7,[v1],#1
subs v2,v2,#1
add v7,v7,a2
strb v7,[a1],#1
bne loop
ldmfd sp!,{v1-v6,pc} #std // restore caller registers
.end
At first, you should prototype in real C:
// csinsert_real -- real C code
char *
csinsert_real(char *s1,char *s2,int loc)
{
int s1len;
int s2len;
char *bp;
int chr;
char *bf;
s1len = strlen(s1);
s2len = strlen(s2);
bf = malloc(s1len + s2len + 1);
bp = bf;
// copy over s1 up to but not including the "insertion" point
for (; loc > 0; --loc, ++s1, ++bp) {
chr = *s1;
if (chr == 0)
break;
*bp = chr;
}
// "insert" the s2 string
for (chr = *s2++; chr != 0; chr = *s2++, ++bp)
*bp = chr;
// copy the remainder of s1 [if any]
for (chr = *s1++; chr != 0; chr = *s1++, ++bp)
*bp = chr;
*bp = 0;
return bf;
}
Then, you can [until you're comfortable with arm], prototype in C "pseudocode":
// csinsert_pseudo -- pseudo arm code
char *
csinsert_pseudo()
{
// save caller registers
v1 = a1;
v2 = a2;
v3 = a3;
a1 = v1;
strlen();
v4 = a1;
a1 = v2;
strlen();
a1 = a1 + v4 + 1;
malloc();
v5 = a1;
// NOTE: load/store may only use r0-r7
// and a1 is r0
#if 0
r0 = a1;
#endif
r1 = v1;
r2 = v2;
// copy over s1 up to but not including the "insertion" point
loop1:
if (v3 == 0) goto eloop1;
r3 = *r1;
if (r3 == 0) goto eloop1;
*r0 = r3;
++r0;
++r1;
--v3;
goto loop1;
eloop1:
// "insert" the s2 string
loop2:
r3 = *r2;
if (r3 == 0) goto eloop2;
*r0 = r3;
++r0;
++r2;
goto loop2;
eloop2:
// copy the remainder of s1 [if any]
loop3:
r3 = *r1;
if (r3 == 0) goto eloop3;
*r0 = r3;
++r0;
++r1;
goto loop3;
eloop3:
*r0 = 0;
a1 = v5;
// restore caller registers
}

How does function ACTUALLY return struct variable in C?

How does a function return value is clear to me, just to kick start:
int f()
{
int a = 2;
return a;
}
Now a gets the memory in stack and its life-span is within the f() in order to return the value it copies the value to a special register which is read by the caller as it knows that the callee have placed the value for him.
(Since the size of return-value-holder special register size is limited that's why we cant return large objects therefore In case of advance languages when we want to return object function actually copies the address of object in heap to that special register)
Lets come back to C for a situation when i want to return a struct variable not pointer:
struct inventory
{
char name[20];
int number;
};
struct inventory function();
int main()
{
struct inventory items;
items=function();
printf("\nam in main\n");
printf("\n%s\t",items.name);
printf(" %d\t",items.number);
getch();
return 0;
}
struct inventory function()
{
struct inventory items;
printf(" enter the item name\n ");
scanf(" %s ",&items.name );
printf(" enter the number of items\n ");
scanf("%d",&items.number );
return items;
}
Code forked from: https://stackoverflow.com/a/22952975/962545
Here is the deal,
Lets start with main, items variable declared but not initialized and then function is called which return initialized structure variable which gets copied to the one in main.
Now I am bit blurred to understand how function() returned struct variable items which is not dynamically created(technically not in heap) so this variable's life-span is within the function() body, also size of variable item can be huge enough not to fit in special register so why it worked?.(I know we can dynamically allocate item inside function and return the address but I don't want alternative, I am looking for explanation)
Question:
Although it works but how does function() actually returned the struct variable and get copied to items variable in main when it is supposed to die with function() return.
I am surely missing important thing, detailed explanation would help. :)
EDIT:
Other Answer References:
https://stackoverflow.com/a/2155742/962545
Named as return value optimization

Details vary widely by calling convention. Some ABIs have no calling convention for passing whole structures, in which case the compiler is free to do whatever it thinks makes sense.
Examples include:
Passing and returning the entire struct as a series of consecutive registers (often used with "small" structs)
Placing the entire struct as an argument block on the stack
Allocating an empty argument big enough to hold the struct, to be filled with a return value
Passing the (stack) address of the struct as an argument (as if the function was declared void function(struct inventory *))
Any of these implementations could conform to the C spec here. But, let's look at a specific implementation: the output from my GCC ARM cross-compiler.
Compiling the code you gave gives me this:
main:
stmfd sp!, {fp, lr}
add fp, sp, #4
sub sp, sp, #48
sub r3, fp, #52
mov r0, r3
bl function(PLT)
Destination operands are always on the left. You can see that the program reserves stack space, then passes the address of the stack space as r0 (the first argument in the ARM EABI calling convention). function takes no arguments, so this argument is clearly an artificial argument added by our compiler.
function looks like this:
function:
stmfd sp!, {r4, fp, lr}
add fp, sp, #8
sub sp, sp, #36
str r0, [fp, #-40]
ldr r3, .L6
...
add r2, pc, r2
mov r0, r2
mov r1, r3
bl scanf(PLT)
ldr r3, [fp, #-40]
mov ip, r3
sub r4, fp, #36
ldmia r4!, {r0, r1, r2, r3}
stmia ip!, {r0, r1, r2, r3}
ldmia r4, {r0, r1}
stmia ip, {r0, r1}
ldr r0, [fp, #-40]
sub sp, fp, #8
ldmfd sp!, {r4, fp, pc}
This code basically stashes the single argument in [fp, #-40], then later loads it and begins stashing data at the address it points to. At the end, it returns this pointer value in r0 again. Effectively, the compiler has made the function signature into
struct inventory *function(struct inventory *)
where the returned structure is allocated on the stack by the caller, passed in, and then returned.

You're missing the most obvious thing there is to C's way of passing/returning things: everything is passed around by value, or at least: it behaves that way.
That is to say:
struct foo some_f( void )
{
struct foo local = {
.member = 123,
.bar = 2.0
};
//some awsome code
return local;
}
Will work, just fine. If the struct is small, then it's possible that this code will create a local struct variable, and return a copy of that struct to the caller.
In other cases, however, this code will roughly translate to :
void caller()
{
struct foo hidden_stack_space;
struct foo your_var = *(some_f(&hidden_stack_space));
}
//and the some_f function will behave as:
struct foo * some_f(struct foo * local)
{
//works on local and
return local;
}
Well, this isn't Exactly what happens all of the time, but it boils down to this, more or less. The result will be the same, but compilers may behave differently in this case.
Bottom line is: C returns by value, so your code works fine.
However, there are pitfalls:
struct foo
{
int member1;
char *str;
};
struct foo some_f()
{
char bar[] = "foobar";
struct foo local = {
.member1 = 123,
.str = &bar[0]
};
return local;
}
Is dangerous: the pointer assigned to local.str points to memory that will be released once the struct is returned. In that case, the problems you expected with this code are true: that memory is no more (or is not valid anymore).
Simply because a pointer is a variable whose value is the mem address, and that value is returned/assigned.

A struct, at least a large one, will be allocated and returned on the stack, and will be popped off the stack (if at all) by the caller. The compiler will try to allocate it in the same spot where the caller is expecting to find this, but it will make a copy if that is not possible. It is possible, but not necessary that there is also a pointer to the struct, returned via registers.
Of course the details will vary depending on architecture.

How to define a stack for each task on an ARM

Today I have a little proble and I think that the origin is about the stack.
This is my problem :
I declare three user task like this :
void task1Function(void) {
print_uart0("-usertask : First task is started...\r\n");
while(1){
//syscall(1);
}
}
void task2Function(void) {
print_uart0("-usertask : Second task is running...\r\n");
//syscall(); /* To return in the kernel's mode */
while(1){
}
}
void task3Function(void) {
print_uart0("-usertask : Third task is running...\r\n");
//syscall(); /* To return in the kernel's mode */
while(1){
}
}
I have an array of three task whose the strucure is below :
typedef struct
{
unsigned int *sp;
unsigned int registers[12];
unsigned int lr;
unsigned int pc;
unsigned int cpsr;
unsigned int mode;
unsigned int num;
void *stack;
int stacksize;
int priority;
int state; /* Running, Ready, Waiting, Suspended */
/* Next and previous task in the queue */
struct taskstruct *qnext, *qprev;
}taskstruct;
Here the inialisation of my tasks :
void init_task(taskstruct * task, void (*function)(void) ){
task->sp = (unsigned int*)&function;
task->registers[0] = 0; // r0
task->registers[1] = 0; // r1
task->registers[2] = 0; // r2
task->registers[3] = 0; // r3
task->registers[4] = 0; // r4
task->registers[5] = 0; // r5
task->registers[6] = 0; // r6
task->registers[7] = 0; // r7
task->registers[8] = 0; // r8
task->registers[9] = 0; // r9
task->registers[10] = 0; // r10
task->registers[11] = 0; // r11
task->registers[12] = 0; // r12
task->lr = 0;
task->pc = 0;
task->cpsr = 0;
task->mode = 0x10;
}
init_task(&task[0],&task1Function);
init_task(&task[1],&task2Function);
init_task(&task[2],&task3Function);
But when I passed the task[0].sp to my activate function, it's always the last declared task that is launched (i.e the third):
.global activate
activate:
LDR r12, [r0]
/*STMFD sp!,{r1-r11,lr}*/
NOP
msr CPSR_c, #0x10 /* User mode with IRQ enabled and FIQ disabled*/
mov pc, r12
So I guess that I have a problem with my user stack and that I have to setup a different stack for each of them, am I right ? In this case, can somebody tell me how I have to proceed ?
Regards, Vincent

Yes, you are right.
You must have separate stacks for each task, since the state of the stack is part of the task's total state.
You could add something like
uint32_t stack[512];
to your taskstruct, and then set/restore the task pointer accordingly when switching tasks, of course. It's hard to provide enough detail in this context.

'Ok but how I can allocate the stack task ? ' - malloc it.
You can economize on mallocs by adding the stack space on the end of the taskstruct - all your current members are fixed-size, so calculating the total size to malloc is easy.
The hard bit with these taskers is getting the interrupt entry correct, ie. when an interrupt handler needs to change a task state and so has to exit via the scheduler. This is especially good fun with nested interrupts. Good luck with that!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight