BPF verifier says program exceeds 1M instruction - c

For the following program I get an error from the verifier saying that it exceeds 1M instructions, even though it shouldn't. The program finds the hostname of a HTTP packet.
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
struct server_name {
char server_name[256];
__u16 length;
};
#define MAX_SERVER_NAME_LENGTH 253
#define HEADER_LEN 6
SEC("xdp")
int collect_ips_prog(struct xdp_md *ctx) {
char *data_end = (char *)(long)ctx->data_end;
char *data = (char *)(long)ctx->data;
int host_header_found = 0;
for (__u16 i = 0; i <= 512 - HEADER_LEN; i++) {
host_header_found = 0;
if (data_end < data + HEADER_LEN) {
goto end;
}
// Elf loader does not allow NULL terminated strings, so have to check each char manually
if (data[0] == 'H' && data[1] == 'o' && data[2] == 's' && data[3] == 't' && data[4] == ':' && data[5] == ' ') {
host_header_found = 1;
data += HEADER_LEN;
break;
}
data++;
}
if (host_header_found) {
struct server_name sn = {"a", 0};
for (__u16 j = 0; j < MAX_SERVER_NAME_LENGTH; j++) {
if (data_end < data + 1) {
goto end;
}
if (*data == '\r') {
break;
}
sn.server_name[j] = *data++;
sn.length++;
}
}
end:
return XDP_PASS;
}
Ignore that data is not pointing to the beginning of the HTTP payload of a packet. This is enough to reproduce the problem I'm seeing.
I get the following error:
; for (__u16 j = 0; j < MAX_SERVER_NAME_LENGTH; j++) {
76: (25) if r3 > 0xfb goto pc+3
77: (07) r3 += 1
78: (07) r4 += 8
79: (3d) if r1 >= r4 goto pc-15
from 79 to 65: R0_w=fp-189 R1=pkt_end(id=0,off=0,imm=0) R2=pkt(id=0,off=280,r=363,imm=0) R3_w=invP76 R4_w=pkt(id=0,off=363,r=363,imm=0) R5_w=inv(id=0,umin_value=1,umax_value=65536,var_off=(0x0; 0x1ffff)) R10=fp0 fp-8=??????mm fp-16=00000000 fp-24=00000000 fp-32=00000000 fp-40=00000000 fp-48=00000000 fp-56=00000000 fp-64=00000000 fp-72=00000000 fp-80=00000000 fp-88=00000000 fp-96=00000000 fp-104=00000000 fp-112=00000000 fp-120=00000000 fp-128=00000000 fp-136=00000000 fp-144=00000000 fp-152=00000000 fp-160=00000000 fp-168=00000000 fp-176=00000000 fp-184=00000000 fp-192=0000mmmm fp-200=mmmmmmmm fp-208=mmmmmmmm fp-216=mmmmmmmm fp-224=mmmmmmmm fp-232=mmmmmmmm fp-240=mmmmmmmm fp-248=mmmmmmmm fp-256=mmmmmmmm fp-264=mmmmmmmm
; if (*data == '\r') {
65: (bf) r4 = r2
66: (0f) r4 += r3
67: (71) r5 = *(u8 *)(r4 +6)
BPF program is too large. Processed 1000001 insn
processed 1000001 insns (limit 1000000) max_states_per_insn 34 total_states 10376 peak_states 7503 mark_read 3
This doesn't make sense because there should be at most 20 instructions in the second for loop, which would make for a max of 5060 instructions if the max iterations are reached. The smallest value I can decrease MAX_SERVER_NAME_LENGTH to for the verifier to pass is 104. If I comment out the if (host_header_found) { block, then the verifier succeeds.

TL;DR. Your program is too complex for the verifier to analyze, as it must iterate over more than 1 million instructions to verify the full program.
Verifier Error Explanation
BPF program is too large. Processed 1000001 insn
The verifier errors because it already analyzed 1 million instructions. Hence it reached the limit and is giving up.
This verifier error is indeed a bit misleading. The BPF program isn't actually too large. The number of instructions the verifier has to analyze is different from the number of instructions in the whole program because the verifier has to analyze each and every path through the program. Therefore, it may analyze the same instruction several times, along different paths.
How can such a small program require more than 1M analyzed instructions?
The verifier reached 1 million instructions because your program has a lot of different paths. Indeed, your program has two loops with fairly high bounds (506 and 253), which themselves contain several conditions (to simplify, ~2 in each). In the worst case, the verifier may have to analyze each instruction on all possible paths through those two loops.
How can I fix it?
You may reduce the size of the loop (as you figured) to reduce the complexity. You may also simplify the loop bodies.
Alternatively, you could break your program with tail calls. Maybe one tail call between the two loops would be enough to pass the verifier.

Related

APM32 C Copy & Execute function in RAM

hi i am using an APM32F003 with Keil uVision compiler.
is a little known microcontroller but compatible with STM32.
I would like to write functions in RAM for different purposes.
I don't want to use the linker attribute to assign the function in ram,
but I want to copy a written one in flash and transfer it in RAM in run-time.
below the code I am trying to write but for now it is not working.
I think it's not possible in this way right?
static volatile uint8_t m_buffer_ram[100];
void flash_function()
{
/* Example */
LED2_ON();
}
void flash_function_end()
{
}
void call_function_in_ram()
{
uint32_t size = (uint32_t) flash_function_end - (uint32_t) flash_function;
/* clone function in RAM */
for (uint32_t i = 0; i < size; i++)
m_buffer_ram[i] = (((uint8_t*)&flash_function)[i]);
__disable_irq();
/* cast buffer to function pointer */
void(*func_ptr)(void) = (void(*)(void)) (&m_buffer_ram);
/* call function in ram */
func_ptr();
__enable_irq();
}
Eugene asked if your function is relocatable. This is very important. I have had issues in the past wherein I copied a function from flash to RAM, and the compiler used an absolute address in the "flash" based function. Therefore the code which was running in RAM jumped back into the flash. This is just one example of what might go wrong with moving code which is not relocatable.
If you have a debugger that can disassemble and also step through the compiled code for you, that would be ideal.
Note also "the busybee" pointed out that code which is adjacent in source code does is not guaranteed to be adjacent in the compiled binary, so your method of finding the size of the code is not reliable.
You can look in the map file to determine the size of the function.
I agree with the comment that you would be better off learning to have the linker do the work for you.
None of what I am saying here is new; I am just reinforcing the comments made above.
CODE
static volatile uint8_t m_buffer_ram[200];
static uint32_t m_function_size;
void flash_function(void)
{
LED2_ON();
}
void flash_function_end(void)
{
}
void test(void)
{
m_function_size = (uint32_t) flash_function_end - (uint32_t) flash_function;
/* clone function in RAM */
for (uint16_t i = 0; i < m_function_size; i++)
m_buffer_ram[i] = (((uint8_t*)&flash_function)[i]);
__disable_irq();
/* cast buffer to function pointer, +1 Thumb Code */
void(*func_ptr)(void) = (void(*)(void)) (&m_buffer_ram[1]);
/* call function in ram */
func_ptr();
__enable_irq();
}
MAP
Image Symbol Table
Symbol Name Value Ov Type Size Object(Section)
Local Symbols
.....
m_function_size 0x20000024 Data 4 test.o(.data)
m_buffer_ram 0x200001f0 Data 200 test.o(.bss)
Global Symbols
.....
flash_function 0x00000399 Thumb Code 12 test.o(i.flash_function)
flash_function_end 0x000003a9 Thumb Code 2 test.o(i.flash_function_end)
Memory Map of the image
Exec Addr Load Addr Size Type Attr Idx E Section Name Object
.....
0x00000398 0x00000398 0x00000010 Code RO 355 i.flash_function test.o
0x000003a8 0x000003a8 0x00000002 Code RO 356 i.flash_function_end test.o
DISASSEMBLE
.....
30: m_function_size = (uint32_t) flash_function_end - (uint32_t) flash_function;
31:
0x00000462 480D LDR r0,[pc,#52] ; #0x00000498
0x00000464 4A0D LDR r2,[pc,#52] ; #0x0000049C
0x00000466 4C0E LDR r4,[pc,#56] ; #0x000004A0
0x00000468 1A81 SUBS r1,r0,r2
0x0000046A 6021 STR r1,[r4,#0x00]
32: for (uint16_t i = 0; i < m_function_size; i++)
0x0000046C 2000 MOVS r0,#0x00
33: m_buffer_ram[i] = (((uint8_t*)&flash_function)[i]);
34:
0x0000046E 4B0D LDR r3,[pc,#52] ; #0x000004A4
0x00000470 2900 CMP r1,#0x00
0x00000472 D905 BLS 0x00000480
33: m_buffer_ram[i] = (((uint8_t*)&flash_function)[i]);
0x00000474 5C15 LDRB r5,[r2,r0]
0x00000476 541D STRB r5,[r3,r0]
32: for (uint16_t i = 0; i < m_function_size; i++)
0x00000478 1C40 ADDS r0,r0,#1
0x0000047A B280 UXTH r0,r0
32: for (uint16_t i = 0; i < m_function_size; i++)
0x0000047C 4288 CMP r0,r1
0x0000047E D3F9 BCC 0x00000474
35: __disable_irq();
36:
0x00000480 B672 CPSID I
37: void(*func_ptr)(void) = (void(*)(void)) (&m_buffer_ram[1]);
0x00000482 1C5B ADDS r3,r3,#1
38: func_ptr();
39:
0x00000484 4798 BLX r3
40: __enable_irq();
41:
0x00000486 B662 CPSIE I
I report all the information that I was able to recover.
I added a shift for the Thumb Code; the calculation of the function size coincides with the MAP file
my doubt is that in debug the pointer cannot jump to a point of the RAM .. for this reason I activate a led to see if (flashing code and run) this turns on without debugging.
as reported below, the read values coincide
(0x000003a8)flash_function_end - (0x00000398)flash_function = 0x10
(0x20000024)m_function_size = 0x10
func_ptr = 0x200001f1;

How can I make this loop run faster?

I'm using this code to find the highest temperature pixel in a thermal image and the coordinates of the pixel.
void _findMax(uint16_t *image, int sz, sPixelData *returnPixel)
{
int temp = 0;
for (int i = sz; i > 0; i--)
{
if (returnPixel->temperature < *image)
{
returnPixel->temperature = *image;
temp = i;
}
image++;
}
returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
}
With an image size of 640x480 it takes around 35ms to run through this function, which is too slow for what I need it for (under 10ms ideally).
This is executing on an ARM A9 processor running Linux.
The compiler I'm using is ARM v8 32-Bit Linux gcc compiler.
I'm using optimize -O3 and the following compile options: -march=armv7-a+neon -mcpu=cortex-a9 -mfpu=neon-fp16 -ftree-vectorize.
This is the output from the compiler:
000127f4 <_findMax>:
for(int i = sz; i > 0; i--)
127f4: e3510000 cmp r1, #0
{
127f8: e52de004 push {lr} ; (str lr, [sp, #-4]!)
for(int i = sz; i > 0; i--)
127fc: da000014 ble 12854 <_findMax+0x60>
12800: e1d2c0b0 ldrh ip, [r2]
12804: e2400002 sub r0, r0, #2
int temp = 0;
12808: e3a0e000 mov lr, #0
if(returnPixel->temperature < *image)
1280c: e1f030b2 ldrh r3, [r0, #2]!
12810: e153000c cmp r3, ip
returnPixel->temperature = *image;
12814: 81a0c003 movhi ip, r3
12818: 81a0e001 movhi lr, r1
1281c: 81c230b0 strhhi r3, [r2]
for(int i = sz; i > 0; i--)
12820: e2511001 subs r1, r1, #1
12824: 1afffff8 bne 1280c <_findMax+0x18>
12828: e30c3ccd movw r3, #52429 ; 0xcccd
1282c: e34c3ccc movt r3, #52428 ; 0xcccc
12830: e0831e93 umull r1, r3, r3, lr
12834: e1a034a3 lsr r3, r3, #9
12838: e0831103 add r1, r3, r3, lsl #2
1283c: e6ff3073 uxth r3, r3
12840: e04ee381 sub lr, lr, r1, lsl #7
12844: e6ffe07e uxth lr, lr
returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
12848: e1c2e0b4 strh lr, [r2, #4]
returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
1284c: e1c230b6 strh r3, [r2, #6]
}
12850: e49df004 pop {pc} ; (ldr pc, [sp], #4)
for(int i = sz; i > 0; i--)
12854: e3a03000 mov r3, #0
12858: e1a0e003 mov lr, r3
1285c: eafffff9 b 12848 <_findMax+0x54>
For clarity after comments:
Each pixel is a unsigned 16 bit integer, image[0] would be the pixel with coordinates 0,0, and the last in the array would have the coordinates 639,479.
This is executing on an ARM A9 processor running Linux.
ARM Cortex-A9 supports Neon.
With this in mind the goal should be to load 8 values (128 bits of pixel data) into a register, then do "compare with the current maximums for each of the 8 places" to get a mask, then use the mask and its inverse to mask out the "too small" old maximums and the "too small" new values; then OR the results to merge the new higher values into the "current maximums for each of the 8 places".
Once that has been done for all pixels (using a loop); you'd want to find the highest value in the "current maximums for each of the 8 places".
However; to find the location of the hottest pixel (rather than just how hot it is) you'd want to split the image into tiles (e.g. maybe 8 pixels wide and 8 pixels tall). This allows you to find the max. temperature within each tile (using Neon); then find the pixel within the hottest tile. Note that for huge images this lends itself to a "multi-layer" approach - e.g. create a smaller image containing the maximum from each tile in the original image; then do the same thing again to create an even smaller image containing the maximum from each "group of tiles", then ...
Making this work in plain C means trying to convince the compiler to auto-vectorize. The alternatives are to use compiler intrinsics or inline assembly. In any of these cases, using Neon to do 8 pixels in parallel (without any branches) could/should improve performance significantly (how much depends on RAM bandwidth).
You should minimize memory access, especially in loops.
Every * or -> could cause unnecessary memory access resulting in serious performance hits.
Local variables are your best friend:
void _findMax(uint16_t *image, int sz, sPixelData *returnPixel)
{
int temp = 0;
uint16_t temperature = returnPixel->temperature;
uint16_t pixel;
for (int i = sz; i > 0; i--)
{
pixel = *image++;
if (temperature < pixel)
{
temperature = pixel;
temp = i;
}
}
returnPixel->temperature = temperature;
returnPixel->x_location = temp % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = temp / IMAGE_HORIZONTAL_SIZE;
}
Below is how this can be optimized by utilizing neon:
#include <stdint.h>
#include <arm_neon.h>
#include <assert.h>
static inline void findMax128_neon(uint16_t *pDst, uint16x8_t *pImage)
{
uint16x8_t in0, in1, in2, in3, in4, in5, in6, in7, in8, in9, in10, in11, in12, in13, in14, in15;
uint16x4_t dmax;
in0 = vld1q_u16(pImage++);
in1 = vld1q_u16(pImage++);
in2 = vld1q_u16(pImage++);
in3 = vld1q_u16(pImage++);
in4 = vld1q_u16(pImage++);
in5 = vld1q_u16(pImage++);
in6 = vld1q_u16(pImage++);
in7 = vld1q_u16(pImage++);
in8 = vld1q_u16(pImage++);
in9 = vld1q_u16(pImage++);
in10 = vld1q_u16(pImage++);
in11 = vld1q_u16(pImage++);
in12 = vld1q_u16(pImage++);
in13 = vld1q_u16(pImage++);
in14 = vld1q_u16(pImage++);
in15 = vld1q_u16(pImage);
in0 = vmaxq_u16(in1, in0);
in2 = vmaxq_u16(in3, in2);
in4 = vmaxq_u16(in5, in4);
in6 = vmaxq_u16(in7, in6);
in8 = vmaxq_u16(in9, in8);
in10 = vmaxq_u16(in11, in10);
in12 = vmaxq_u16(in13, in12);
in14 = vmaxq_u16(in15, in14);
in0 = vmaxq_u16(in2, in0);
in4 = vmaxq_u16(in6, in4);
in8 = vmaxq_u16(in10, in8);
in12 = vmaxq_u16(in14, in12);
in0 = vmaxq_u16(in4, in0);
in8 = vmaxq_u16(in12, in8);
in0 = vmaxq_u16(in8, in0);
dmax = vmax_u16(vget_high_u16(in0), vget_low_u16(in0));
dmax = vpmax_u16(dmax, dmax);
dmax = vpmax_u16(dmax, dmax);
vst1_lane_u16(pDst, dmax, 0);
}
void _findMax_neon(uint16_t *image, int sz, sPixelData *returnPixel)
{
assert((sz % 128) == 0);
const uint32_t nSector = sz/128;
uint16_t max[nSector];
uint32_t i, s, nMax;
uint16_t *pImage;
for (i = 0; i < nSector; ++i)
{
findMax128_neon(&max[i], (uint16x8_t *) &image[i*128]);
}
s = 0;
nMax = max[0];
for (i = 1; i < nSector; ++i)
{
if (max[i] > nMax)
{
s = i;
nMax = max[i];
}
}
if (nMax < returnPixel->temperature)
{
returnPixel->x_location = 0;
returnPixel->y_location = 0;
return;
}
pImage = &image[s];
i = 0;
while(1) {
if (*pImage++ == nMax) break;
i += 1;
}
i += 128 * s;
returnPixel->temperature = nMax;
returnPixel->x_location = i % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = i / IMAGE_HORIZONTAL_SIZE;
}
Beware that the function above assumes sz being a multiple of 128.
And yes, it will run in less than 10ms.
The culprit here is the slow linear search for the highest "temperature". I'm not quite sure how to improve that search algorithm with the information given, if at all possible (could you sort the data in advance?), but you could start with this:
uint16_t max = 0;
size_t found_index = 0;
for(size_t i=0; i<sz; i++)
{
if(max < image[i])
{
max = image[i];
found_index = sz - i - 1; // or whatever makes sense here for the algorithm
}
}
returnPixel->temperature = max;
returnPixel->x_location = found_index % IMAGE_HORIZONTAL_SIZE;
returnPixel->y_location = found_index / IMAGE_HORIZONTAL_SIZE;
This might give a very slight performance gain because of the top to bottom iteration order and not touching unrelated memory returnPixel in the middle of the loop. max should get stored in a register and with luck you might get slightly better cache performance overall. Still, this comes with a branch like the original code, so it is a minor improvement.
Another micro-optimization is to change the parameter to const uint16_t* image - this might give slightly better pointer aliasing, in case returnPixel happens to contain a uint16_t too. image should be const regardless of performance, since const correctness is always good practice.
Further more obscure optimization tricks might be possible if you read image 32 or 64 bits at a time, then come up with a fast look-up method to find the largest image inside that 32/64 bit chunk.
If you have to find the hottest pixel in the image and if there is no structure to the image data itself then I think your stuck with iterating through the pixels. If so you have a number of different ways to make this faster:
As suggested above try loop unrolling and other micro-optimisation tricks, this might give you the performance boost you need
Go parallel, split the array into N chunks and find the MAX[N] for each chunk and then find the largest of the MAX[N] values. You have to be careful here as setting up the parallel processes can take longer than doing the work.
If there is some structure to the image, lots of cold pixels and a hot spot (larger than 1 pixel) that your trying to find say, then there are other techniques you could use.
One approach could be to split the image up into N boxes and then sample each box, The hottest pixel in the hottest box (and maybe in the boxes adjacent too it) would then be your result. However this depends on their being some structure to the image which you can rely on.
The assembly language reveals the compiler is storing in returnPixel->temperature each time a new maximum is found, with the instruction strhhi r3, [r2]. Eliminate this unnecessary store by caching the maximum in a local object and only updating returnPixel->temperature after the loop ends:
uint16_t maximum = returnPixel->temperature;
for (int i = sz; i > 0; i--)
{
if (maximum < *image)
{
maximum = *image;
temp = i;
}
image++;
}
returnPixel->temperature = maximum;
That is unlikely to reduce the execution time as much as you need, but it might if there is some bad cache or memory interaction occurring. It is a very simple change, so try it before moving on to the SIMD vectorizations suggested in other answers.
Regarding vectorization, two approaches are:
Iterate through the image using a vmax instruction to update the maximum value seen so far in each SIMD lane. Then consolidate the lanes to find the overall maximum. Then iterate through the image again looking for that maximum in any lane. (I forget what that architecture has for instructions that would assist in testing whether a comparison produced true in any lane.)
Iterate through the image maintaining three registers: One with the maximum seen so far in each lane, one with a counter of position in the image, and one with, in each lane, a record of the counter value at the time each new maximum was seen. The first can be updated with vmax, as above. The second can be updated with vadd. The third can be updated with vcmp and vbit. After the loop, figure out which lane has the maximum, and get the position of the maximum from the recorded counter for that lane.
Depending on the performance of the necessary instructions, a hybrid approach may be faster:
Set some strip size S. Partition the image into strips of that size. For each strip in the image, find the maximum (using the fast vmax loop described above). If the maximum is greater than seen in previous strips, remember it and the current strip number. After processing the whole image, the task has been reduced to finding the location of the maximum in a particular strip. Use the second loop from the first approach above for that. (For large images, further refinements may be possible, possibly refining the location using a shorter strip size before finding the exact location, depending on cache behavior and other factors.)
In my opinion you cannot improve that algorithim, because any array element can hold the maximum, so you need to do at least one pass through the data, I don't believe you can improve this without going multithreading. you can start several threads (as many as cores/processors you may have) and give each a subset of your image. Once they are finished, you will have as many local maximum values as the number of threads you started. just do a second pass on those vaues to get the maximum total value, and you are finished. But consider the extra workload of creating threads, allocating stack memory for them, and scheduling, as that can be higher if the number of values is low than the workload of running all in a single thread. If you have a thread pool somewhere to provide ready to run threads, and that is something you can get on, then probably you'll be able to finished in one Nth part of the time to run all the loop in one single processor (where N is the number of cores you have on the machine)
Note: using a dimension that is a power of two will save you the job of calculating a quotient and a remainder by solving the division problem with a bit shift and a bit mask. You use it only once in your function, but it's an improving, anyway.

Determine length in bytes of content of a variable

I want to automatically determine length in bytes of a field (addr) that is uint32, based on it's contents. Compiler is GCC. I use this:
uint8 len;
if(addr < 256) len = 1;
else if (addr < 65536) len = 2;
else if (addr < 16777216) len = 3;
else len = 4;
Is there a more efficient way?
This is inside a SPI function for a embedded device. I'm interested in the fastest way except macros, since addr can be a variable.
You can do it using an approach similar to binary search: first compare to 65536, then either to 256 or 16777216, depending on the outcome of the first comparison. This way you always finish in two comparisons, while your code sometimes would require three:
uint8 len = (addr < 65536)
? ((addr < 256) ? 1 : 2)
: ((addr < 16777216) ? 3 : 4);
gcc has a __builtin_ctz() function which can be used like
if (addr == 0)
len = 0;
else
len = (sizeof(int) * 8 - __builtin_ctz(addr) + 7) / 8;
Update:
under ARM, this compiles to
cmp r0, #0
rbitne r0, r0
clzne r0, r0
rsbne r0, r0, #39
lsrne r0, r0, #3
bx lr
Get the position of the highest bit – the only one that counts for your question (from https://stackoverflow.com/a/14085901). Divide by 8 to get the number of bytes.
addr = 0x20424;
printf ("%d\n", (fls(addr)+7)>>3);
It returns 0 when addr == 0.
fls is conforming to POSIX.1-2001, POSIX.1-2008, 4.3BSD. If your current system does not contain it, look at the above link or What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C? for more suggestions to find the highest bit set.

Why does storing a %x value to a variable do funky things when leading value is 8-f?

I am writing a program in c which imitates an LC-3 simulator. One of the objectives of this program is to store a 4 digit hexadecimal value from a file (0000 - ffff), and to convert it to binary, and interpret an LC-3 instruction from it. The following code segment shows how I am storing this value into a variable (which is where the problem seems to lie), and below that is the output I am receiving:
int *strstr(int s, char c);
void initialize_memory(int argc, char *argv[], CPU *cpu) {
FILE *datafile = get_datafile(argc, argv);
// Buffer to read next line of text into
#define DATA_BUFFER_LEN 256
char buffer[DATA_BUFFER_LEN];
int counter = 0;
// Will read the next line (words_read = 1 if it started
// with a memory value). Will set memory location loc to
// value_read
//
int value_read, words_read, loc = 0, done = 0;
char comment;
char *read_success; // NULL if reading in a line fails.
int commentLine =0;
read_success = fgets(buffer, DATA_BUFFER_LEN, datafile);
while (read_success != NULL && !done) {
// If the line of input begins with an integer, treat
// it as the memory value to read in. Ignore junk
// after the number and ignore blank lines and lines
// that don't begin with a number.
//
words_read = sscanf(buffer, "%04x%c", &value_read, &comment);
// if an integer was actually read in, then
// set memory value at current location to
// value_read and increment location. Exceptions: If
// loc is out of range, complain and quit the loop. If
// value_read is outside 0000 and ffff, then it's a
// sentinel -- we should say so and quit the loop.
if (value_read == NULL || comment ==';')
{
commentLine = 1;
}
if (value_read < -65536 || value_read > 65536)
{
printf("Sentinel read in place of Memory location %d: quitting loop\n", loc);
break;
}
else if (value_read >= -65536 && value_read <= 65536)
{
if (commentLine == 0)
{
if (counter == 0)
{
loc = value_read;
cpu -> memLocation = loc;
printf("\nPC location set to: x%04x \n\n", cpu -> memLocation);
counter++;
}
else
{
cpu -> mem[loc] = value_read;
printf("x%04x : x%d\t %04x \t ", loc,loc, cpu -> mem[loc]);
print_instr(cpu, cpu -> mem[loc]);
loc++;
value_read = NULL;
}
}
}
if (loc > 65536)
{
printf("Reached Memory limit, quitting loop.\n", loc);
break;
}
commentLine = 0;
read_success = fgets(buffer, DATA_BUFFER_LEN, datafile);
// Gets next line and continues the loop
}
fclose(datafile);
// Initialize rest of memory
while (loc < MEMLEN) {
cpu -> mem[loc++] = 0;
}
}
My aim is to show the Hex address : decimal address, the hex instruction, binary code, and then at the end, its LC-3 instruction translation. The data I am scanning from the file is the hex instruction:
x1000 : x4096 200c 0010000000001100 LD, R0, 12
x1001 : x4097 1221 0001001000100000 ADD, R1, R0, 0
x1002 : x4098 1401 0001010000000000 ADD, R2, R0, R0
x1003 : x4099 ffff94bf 0000000000000000 NOP
x1004 : x4100 166f 0001011001101110 ADD, R3, R1, 14
x1005 : x4101 1830 0001100000110000 ADD, R4, R0, -16
x1006 : x4102 1b04 0001101100000100 ADD, R5, R4, R4
x1007 : x4103 5d05 0101110100000100 AND, R6, R4, R4
x1008 : x4104 5e3f 0101111000111110 AND, R7, R0, -2
x1009 : x4105 5030 0101000000110000 AND, R0, R0, -16
x100a : x4106 52ef 0101001011101110 AND, R1, R3, 14
x100b : x4107 5fe0 0101111111100000 AND, R7, R7, 0
x100c : x4108 fffff025 0000000000000000 NOP
x100d : x4109 7fff 0111111111111110 STR, R7, R7, -2
As you can see, my problem lies in addresses x1003 and x100c;
As stated in the headline, when storing the hex instruction, if the value is between 8 and f, my best guess is that the scan is interpreting it as a negative value because of the leading value of the first hex digit in binary. If that is the case, it makes perfect sense, but is there a way I can bypass this? And if it isn't the case, what else could be causing this?
I found that if I pass value_read into print_instr() instead of cpu -> mem[loc], then the output works correctly. However, this is only a temporary fix as I need to store that value for later use in the program(for actual execution of the instruction). So the problem seems to arise while storing, and I am unsure as to why.
Additionally, (and this is a side question) though it is not a pressing concern, since I am using %x%c (value_read, comment) to store values from the file, I have been having trouble with the first few lines of the .hex file I am using, in which there is no hex value in the line, but instead just a comment symbol (for those unfamiliar with lc_3 simulators, the ';' is the symbol for comments). Whenever this is the case, I get a hex value of zero, although I wish for it to be NULL(In my program, I implemented a temporary solution because I am not sure how to fix it). I am not an expert in c just yet, and have not been able to find a solution to this problem. If you can help, it would be greatly appreciated, otherwise, it isn't a big issue for what I am trying to achieve with this program, it is more so just for my own knowledge and growth.
Thank you all in advance for your help :)
In a scanf family format string, the %x specifier means to read into an unsigned int. The corresponding argument must have exactly the type unsigned int *.
However you supply an argument of type int *.
This causes undefined behaviour. What you are seeing is the chance interaction between library elements that expect you to follow the rules, and your code that didn't follow the rules.
To fix it, follow the rules. For example, read into an unsigned int variable.
NB. 0 does nothing in the scanf format string; %04x is equivalent to %4x.
May I suppose that cpu->mem is of type array of short or alike? Then sign extension occurs when printing cpu->mem[loc]. Remind that arguments are at least converted to int at printf calls. Symptom is the same as in the following code:
int i;
scanf("%4x",&i);
printf("%x\n",i);
short s = i;
printf("--> %x\n",s);
The short equals to -1 then when you set it to an int it is converted to -1, 0xffffffff (if 32-bits).
Use unsigned short in place.

Few extremely large values when using RDTSCP in C

I am using the following function to measure clock cycles taken by a simple operation:
uint64_t rdtscp(void) {
uint32_t lo, hi;
__asm__ volatile ("rdtscp"
: /* outputs */ "=a" (lo), "=d" (hi)
: /* no inputs */
: /* clobbers */ "%rcx");
return (uint64_t)lo | (((uint64_t)hi) << 32);
}
To avoid taking into account possible delay due to interrupts, I parsed /proc/interrupts file in the read_intr() function and considered the value only if there were no new interrupts on CPU 0.
These functions are used to measure the cycles taken by the operation in a loop:
for (i = 0; i < 1000; i++)
{
for (j = 0; j < 256; j++)
{
rnd1 = rand() % 256;
intr1 = read_intr();
start = rdtscp();
rnd1 + 1; // Operation
end = rdtscp();
intr2 = read_intr();
if (intr1 == intr2)
{
printf("%lu\n", end - start);
fflush(stdout);
}
}
}
I then ran the above program using:
$ sudo schedtool -F -p 99 -e taskset 0x1 ./a.out > output
That is, the program is scheduled using SCHED_FIFO with 99 priority and forced to run on CPU 0.
I analyzed the output counter values to obtain the following statistical parameters about the execution cycles taken:
Mean = 36.49
Median = 33
However, when the output is sorted in descending order the first few values are:
32593
29081
28878
1113
1065
612
What is the reason for these (very few) extremely high counter readings even after interrupts are being considered? Also, given that the first three sorted readings are much higher than the following three (shown), could this be because of different reasons?
Edit: Based on gudok's comment, I turned off hyper-threading and also automatic power management from the BIOS settings, but still find a discrepancy as mentioned below.

Resources