After compiling some code, the compiler generates a bunch of files. I have statistics, symbols, call tree, errors, list, debug and exe. I have figured out what each means, except for the list file. What is the function of the list file. Is it for the user or the computer/embedded system itself?
The exact contents of the list file varies slightly by tool and chip being used.
The major part of the file will be the translation of the C source code into assembly instructions that has been performed by the compiler. This is useful for debugging the code and to check on the efficiency of the compiler when translating certain source code constructs. In the example below each Cline is given a line number and the assembler listed after. (this example is for the AVR32 processor).
171 /**********************************************************
172 * Test for a receive interrupt
173 **********************************************************/
174 if ( USART_CHANNEL[ Channel ] -> CSR.rxrdy )
000008 F8051502 LSL R5,R12,0x2
00000C ........ MOV R7,LWRD(USART_CHANNEL)
000010 EA17.... ORH R7,HWRD(USART_CHANNEL)
000014 EE0C0027 ADD R7,R7,R12<<0x2
000018 6E0C LD.w R12,R7[0x0]
00001A ........ MOV R6,LWRD(Serial_Receive_Queue)
00001E EA16.... ORH R6,HWRD(Serial_Receive_Queue)
000022 785B LD.w R11,R12[0x14]
000024 A19B LSR R11,0x1
000026 C0B2 BRCC ??USART_Process_Interrupt_1:C
The HEX values that are shown as "...." above are addresses that are not known at compile time, they are symbols that will be resolved at link time.
The list file will also typically give some statistics regarding the code size, RAM requirements and the stack usage for the module being compiled. Again IAR toolset for the AVR32
Maximum stack usage in bytes:
Function CSTACK
-------- ------
Serial_Ports_Initialise 36
-> gpio_enable_module 36
-> usart_init_rs232 36
-> Indirect call 36
-> Indirect call 36
-> Indirect call 36
-> Indirect call 36
Serial_Transmit_With_Length 20
-> xQueueGenericSend 20
-> vTaskDelay 20
USART0_INT_Handler 0
-> USART_Process_Interrupt 0
USART1_INT_Handler 0
-> USART_Process_Interrupt 0
USART2_INT_Handler 0
-> USART_Process_Interrupt 0
USART_Process_Interrupt 32
-> xQueueGenericSendFromISR 32
-> xQueueReceiveFromISR 32
Segment part sizes:
Function/Label Bytes
-------------- -----
Serial_Receive_Queue 24
Serial_Transmit_Queue
USART_CHANNEL 12
USART0_INT_Handler 8
USART1_INT_Handler 8
USART2_INT_Handler 12
USART_Process_Interrupt 112
Serial_Ports_Initialise 172
USART_Channel_In_Use 56
USART_GPIO_MAP
USART_OPTIONS
Serial_Transmit_With_Length 116
?<Initializer for USART_CHANNEL> 12
??USART1_INT_Handler??handle 4
Others 24
400 bytes in segment CODE32
56 bytes in segment DATA32_C
12 bytes in segment DATA32_I
12 bytes in segment DATA32_ID
24 bytes in segment DATA32_Z
28 bytes in segment EVSEG
4 bytes in segment HTAB
24 bytes in segment INITTAB
400 bytes of CODE memory
100 bytes of CONST memory (+ 24 bytes shared)
36 bytes of DATA memory
Errors: none
Warnings: 1
There will also be any error messages or warnings generated inserted at the relevant line of code.
The List file can therefore be used as an aid to estimate stack and memory usage, although stack usage is a highly intractable problem in any embedded system and to see the assembler level code produced by the compiler.
From experience, the list file is not particularly useful when using a source level debugging tool - generally this shows the relevant disassembled code directly.
A list file (.LST) contains a block of C code [commented out by a sequence of period characters] followed by the assembly code for that block.
For example:
.................... return FALSE;
0046: MOVLW 00
0047: MOVWF 21
0048: GOTO 049
Related
I have been trying a few experiments on x86 - namely the effect of mfence on store/load latencies, etc.
Here is what I have started with:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#define ARRAY_SIZE 10
#define DUMMY_LOOP_CNT 1000000
int main()
{
char array[ARRAY_SIZE];
for (int i =0; i< ARRAY_SIZE; i++)
array[i] = 'x'; //This is to force the OS to give allocate the array
asm volatile ("mfence\n");
for (int i=0;i<DUMMY_LOOP_CNT;i++); //A dummy loop to just warmup the processor
struct result_tuple{
uint64_t tsp_start;
uint64_t tsp_end;
int offset;
};
struct result_tuple* results = calloc(ARRAY_SIZE , sizeof (struct result_tuple));
for (int i = 0; i< ARRAY_SIZE; i++)
{
uint64_t *tsp_start,*tsp_end;
tsp_start = &results[i].tsp_start;
tsp_end = &results[i].tsp_end;
results[i].offset = i;
asm volatile (
"mfence\n"
"rdtscp\n"
"mov %%rdx,%[arg]\n"
"shl $32,%[arg]\n"
"or %%rax,%[arg]\n"
:[arg]"=&r"(*tsp_start)
::"rax","rdx","rcx","memory"
);
array[i] = 'y'; //A simple store
asm volatile (
"mfence\n"
"rdtscp\n"
"mov %%rdx,%[arg]\n"
"shl $32,%[arg]\n"
"or %%rax,%[arg]\n"
:[arg]"=&r"(*tsp_end)
::"rax","rdx","rcx","memory"
);
}
printf("Offset\tLatency\n");
for (int i=0;i<ARRAY_SIZE;i++)
{
printf("%d\t%lu\n",results[i].offset,results[i].tsp_end - results[i].tsp_start);
}
free (results);
}
I compile quite simply with gcc microbenchmark.c -o microbenchmark
My system configuration is as follows:
CPU : Intel(R) Core(TM) i7-4790 CPU # 3.60GHz
Operating system : GNU/Linux (Linux 5.4.80-2)
My issue is this:
In a single run, all the latencies are similar
When repeating the experiment over and over, I don't get results similar to the previous run!
For instance:
In run 1 I get:
Offset Latency
1 275
2 262
3 262
4 262
5 275
...
252 275
253 275
254 262
255 262
In another run I get:
Offset Latency
1 75
2 75
3 75
4 72
5 72
...
251 72
252 72
253 75
254 75
255 72
This is pretty surprising (The among-run variation is pretty high, whereas there is negligible within-run variation)! I am not sure how to explain this. What is the issue with my microbenchmark?
Note: I do understand that a normal store would be a write allocate store.. Technically making my measurement that of a load (rather than a store). Also, mfence should flush the store buffer, thereby ensuring that no stores are 'delayed'.
Your warm-up dummy loop only does 1 million iterations, ~6 mil clock cycles in a -O0 debug build - probably not be long enough to get the CPU up to max turbo, on a CPU before Skylake's hardware P-state management. (Idiomatic way of performance evaluation?)
RDTSCP counts fixed-frequency reference cycles, not core clock cycles. Your runs are so short that all the run-to-run variation is probably explained by the CPU frequency being low or high. See How to get the CPU cycle count in x86_64 from C++?
Also, this debug (-O0) build will do extra stores and reloads inside your timed region, but "fortunately" the results[i].offset = i; store plus the mfence before the first rdtscp ensures the result array is also hot in cache before entering the timed region.
Your array is tiny, and you're only doing 1-byte stores (so 64 stores are all in the same cache line.) It's very likely still in MESI Modified state from when you initialized it, so I wouldn't expect an RFO on any of the array[i] = 'y' stores. That already happened for the few lines of stack memory involved before your timed loop. If you want to pre-fault the array without also getting it cached, maybe touch one line per 4k page and leave the other lines untouched. But HW prefetch will get ahead of your stores, especially if you only store 1 byte at a time with 2 slow mfences per store, so again the waiting for off-core memory requests will be outside the timed region. You should expect data to already be in L1d cache or at least L2 in Exclusive state, ready to be flipped to Modified on a store.
BTW, having an offset member seems pointless; it can be implicit from the array index. e.g. print i instead of offset[i]. It's also not very useful to store both start and stop absolute TSC values. You could just store a 32-bit difference, then you wouldn't need to shift / OR in your inline asm, just declare a clobber on the unused EDX output.
Also note that "store latency" typically only matters for performance in real code when mfence is involved. Otherwise the important thing is store->load forwarding, which can happen from the store buffer before the store commits to L1d cache. That's about 6 cycles, or sometimes lower if the reload isn't attempted right away. (It's variable on Sandybridge-family.)
tl;dr : I'm trying to execute dynamically some code from another snippet. But I am stuck with handling memory reference (e.g. mov 40200b, %rdi): can I patch my code or the snippet running code so that 0x40200b is resolved correctly (as the offset 200b from the code)?
To generate the code to be executed dynamically I start from a (kernel) object and I resolve the references using ld.
#!/usr/bin/python
import os, subprocess
if os.geteuid() != 0:
print('Run this as root')
exit(-1)
with open("/proc/kallsyms","r") as f:
out=f.read()
sym= subprocess.Popen( ['nm', 'ebbchar.ko', '-u' ,'--demangle', '-fposix'],stdout=subprocess.PIPE)
v=''
for sym in sym.stdout:
s = " "+ sym.split()[0]+ "\n"
off = out.find(s)
v += "--defsym "+s.strip() + "=0x" +out[off-18:off -2]+" "
print(v)
os.system("ld ebbchar.ko "+ v +"-o ebbchar.bin");
I then transmit the code to be executed with through a mmaped file
int fd = open(argv[1], O_RDWR | O_SYNC);
address1 = mmap(NULL, page_size, PROT_WRITE|PROT_READ , MAP_SHARED, fd, 0);
int in=open(argv[2],O_RDONLY);
sz= read(in, buf+8,BUFFER_SIZE-8);
uint64_t entrypoint=atol(argv[3]);
*((uint64_t*)buf)=entrypoint;
write(fd, buf, min(sz+8, (size_t) BUFFER_SIZE));
I execute code dynamycally with this code
struct mmap_info *info;
copy_from_user((void*)(&info->offset),buf,8);
copy_from_user(info->data, buf+8, sz-8);
unsigned long (*func)(void) func= (void*) (info->data + info->offset);
int ret= func();
This approch work for code that don't access memory such as "\x55\x48\x89\xe5\xc7\x45\xf8\x02\x00\x00\x00\xc7\x45\xfc\x03\x00\x00\x00\x8b\x55\xf8\x8b\x45\xfc\x01\xd0\x5d\xc3" but I have problems when memory is involved.
See example below.
Let's assume i wan't execute dynamically the function vm_close. Objdump -d -S returns:
0000000000401017 <vm_close>:
{
401017: e8 e4 07 40 81 callq ffffffff81801800 <__fentry__>
printk(KERN_INFO "vm_close");
40101c: 48 c7 c7 0b 20 40 00 mov $0x40200b,%rdi
401023: e9 b6 63 ce 80 jmpq ffffffff810e73de <printk>
At execution, my function pointer points to the right code:
(gdb) x/12x $rip
0xffffc90000c0601c: 0x48 0xc7 0xc7 0x0b 0x20 0x40 0x00 0xe9
0xffffc90000c06024: 0xb6 0x63 0xce 0x80
(gdb) x/2i $rip
=> 0xffffc90000c0601c: mov $0x40200b,%rdi
0xffffc90000c06023: jmpq 0xffffc8ff818ec3de
BUT, this code will fail since:
1) In my context $0x40200b points at the physical address $0x40200b, and not offset 200b from the beginning of the code.
2) I don't understand why but the address displayed there is actually different from the correct one (0xffffc8ff818ec3de != ffffffff810e73de) so it won't point on my symbol and will crash.
Is there a way to solve my 2 issues?
Also, I had trouble to find good documentation related to my issue (low-level memory resolution), if you could give me some, that would really help me.
Edit: Since I run the code in the kernel I cannot simply compile the code with -fPIC or -fpie which is not allowed by gcc (cc1: error: code model kernel does not support PIC mode)
Edit 24/09:
According to #Peter Cordes comment, I recompiled it adding mcmodel=small -fpie -mno-red-zone -mnosse to the Makefile (/lib/modules/$(uname -r)fixed/build/Makefile)
This is better than in the original version since the generated code before linking is now:
0000000000000018 <vm_close>:
{
18: ff 15 00 00 00 00 callq *0x0(%rip) # 1e <vm_close+0x6>
printk(KERN_INFO "vm_close");
1e: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 25 <vm_close+0xd>
25: e8 00 00 00 00 callq 2a <vm_close+0x12>
}
2a: c3 retq So thanks to rip-relative addressing
Thus I’m now able to access the other variables on my script…
Thus, after linking I can successfully access my variable embedded within the buffer.
40101e: 48 8d 3d e6 0f 00 00 lea 0xfe6(%rip),%rdi # 40200b
Still, one problem remains:
The symbol I want to access (printk) and my executable buffer are in different address spaces, for exemple:
printk=0xffffffff810e73de:
Executable_buffer=0xffffc9000099d000
But in my callq to printk, I have only 32 bits to write the address to call as an offset from $rip since there is no .got section in the kernel. This means that printk has to be located within [$rip-2GO, $rip+2GO]. But this is not the case there.
Do I have a way to access the printk address although they are located more than 2GO away from my buffer (I tried to used mcmodel=medium but I haven't seen any difference in the generated code), for instance by modifying gcc options so that the binary actually have a .got section?
Or is there a reliable way to force my executable and potentially-too large-for-kmalloc buffer to be allocated in the [0xffffffff00000000 ; 0xffffffffffffffff] range? (I currently use __vmalloc(BUFFER_SIZE, GFP_KERNEL, PAGE_KERNEL_EXEC); )
Edit 27/09:
I succedded in allocationg my buffer in the [0xffffffff00000000 ; 0xffffffffffffffff] range using the non exported __vmalloc_node_range function as a (dirty) hack.
IMPORTED(__vmalloc_node_range)(BUFFER_SIZE, MODULE_ALIGN,
MODULES_VADDR + get_module_load_offset(),
MODULES_END, GFP_KERNEL,
PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
__builtin_return_address(0));
Then, when I know the address of my executable buffer and the address of the kernel symbols (by parsing /proc/kallsyms), I can patch my binary using ld’s option --defsym symbol=relative_address where relative_address = symbol_address - buffer_offset .
Despite being extremely dirt, this approach actually works.
But I need to relink my binary each time I execute it since the buffer may (and will) be allocated at a different address. To solve this issue, I think the best way would be to build my executable as a real position independent executable so that I can just patch the global offset table and not fully relink the module.
But with the options provided there I got a rip-relative address but no got/plt. So I'd like to find a way to build my module as a proper PIE.
This post is getting huge, messy and we are deviating from the original question. Thus, I opened a new simplified post there. If I get interesting answers, I'll edit this post to explain them.
Note: For the sake of simplicity, safety tests are not displayed there
Note 2: I am perfectly aware that my PoC is very unusual and can be a bad practice but I'd like to do it anyway.
I have an f77 unformatted binary file.
I know that the file contains 2 floats and a long integer as well as data.
The size of the file is 536870940 bytes which should include 512^3 float data values together with the 2 floats and the long integer.
The 512^3 float data values make up 536870912 bytes leaving a further 28 bytes.
My problem is that I need to work out where the 28 bytes begins and how to skip this amount of storage so that I can directly access the data.
I prefer to use C to access the file.
Unfortunately, there is no standard what unformatted means. But some methods are more common than others.
In many Fortran versions I have used, every write command writes a header (often unsigned int 32) of how many bytes the data is, then the data, then repeats the header value in case you're reading from the rear.
From the values you have provided, it might be that you have something like this:
uint32(record1 header), probably 12.
float32, float32, int32 (the three 'other values' you talked about)
uint32(record1 header, same as first value)
uint32(record2 header, probably 512^3*4)
float32*512^3
uint32(record2 header, same as before)
You might have to check endianness.
So I suggest you open the file in a hexdump program, and check whether bytes 0-3 are identical to bytes 16-19, and whether bytes 20-23 are repeated at the end of the data again.
If that is the case, I'll try to check the endianness to see whether the values are little or big endian, and with a little luck you'll have your data.
Note: I assume that these three other values are metadata about the data, and therefore would be at the beginning of the file. If that's not the case, you might have them at the end.
Update:
In your comment, you write that your data begins with something like this:
0C 00 00 00 XX XX XX XX XX XX XX XX XX XX XX XX 0C 00 00 00
^- header-^ ^-header -^
E8 09 FF 1F (many, many values) E8 09 FF 1F
^- header-^ ^--- your data ---^ ^-header -^
Now I don't know how to read data in C. I leave this up to you. What you need to do is skip the first 24 bytes, then read the data as (probably little endian) 4-byte floating values. You will have 4 bytes left that you don't need any more.
Important note:
Fortran stores arrays column-major, C afaik stores them row-major. So keep in mind that the order of the indices will be reversed.
I know how to read this in Python:
from scipy.io import FortranFile
ff = FortranFile('data.dat', 'r', '<u4')
# read the three values you are not interested in
threevals = ff.read_record('<u4')
# read the data
data = ff.read_record('<f4')
ff.close()
I've divided matrix by blocks and multiplied it using Fox's algorithm.
How can I print the result matrix to screen, when that is stored by blocks in different processes, without sending these blocks back to the process with rank 0?
For example.
After multiplication I've got:
Block A:
83 64
112 76
Block B:
118 44
152 34
Block C:
54 68
67 56
Block D:
89 85
114 68
Entire matrix should look like:
83 64 118 44
112 76 152 34
54 68 89 85
67 56 114 68
So far I've made:
Send two blocks that contain one row and print it to screen. But is it possible to print entire result matrix without sending more than one block to process 0?
// Function for gathering the result matrix
// pCBlock - one block containing part of entire result matrix
// Size - matrix dimension
// BlockSize - block dimension
void ResultCollection(double* pCblock, int Size,
int BlockSize) {
double * pResultRow = new double[Size*BlockSize];
for (int i = 0; i<BlockSize; i++) {
MPI_Gather(&pCblock[i*BlockSize], BlockSize, MPI_DOUBLE,
&pResultRow[i*Size], BlockSize, MPI_DOUBLE, 0, RowComm);
}
//print two matrix rows from two blocks
delete[] pResultRow;
}
This can't help
( Ordering Output in MPI )
because for the matrix output I need to print not the entire block A, than B, than C, than D,
but rather
one line from A ( in process 0 ), one line from B ( from process 1 ),
one line from A ( in process 0 ), one line from B ( from process 1 ),
one line from C ( from process 2 ), one line from D ( from process 3 )
and etc.
Example matrix and blocks
How can I print ... without sending these blocks back to the process with rank 0?
Well, it is time to realise,
that unless the process with rank 0 was equipped with some sort of clairvoyance, it will never be able to pretty-print any results, that were remotely computed in a herd of decentralised, distributed-processes.
Similarly, it is easy to test,
if you still do not believe what has been published on this, that MPI-distributed code was never promised to have any weak/strong warranty of how the principally uncoordinated delivery of any asynchronously remote-printed character-streams will centrally got ad-hoc ordered into one common serial output -- the system stdout -- and finally put onto the screen.
Even if you would play a lot with "addressable-ANSI-coded-screen", such design-efforts will not yield any universally working code and the tricks to inject an "absolute"-addressing into the ANSI-coded output would be obsessively awfull both to implement and to operate so as to paint a result on screen correctly.
No. Better do not try neither of these ideas.
Your actual MPI-infrastructure advisors / admins will for sure help you and show you appropriate tools for smart-collecting the results and post-process 'em accordingly.
I've been reading the description of xz file format ( http://tukaani.org/xz/xz-file-format.txt ). But when I try to look into an xz file with binary editor, it doesn't seem to follow the structure defined in the description. What am I missing?
I compressed the description file (xz-file-format.txt) with xz cli utility in linux (xz version 4.999.9beta) and these are the first 32 bytes I get:
FD 37 7A 58 5A 00 00 04 E6 D6 B4 46 02 00 21 01 16 00 00 00 74 2F E5 A3 E0 A9 28 2A 99 5D 00 05
Overall structure of the file should be: stream - stream padding - stream - and so on. And in this case I think there should be only one stream since there is only one file compressed in the file. Structure of the stream is: stream header - block - block - ... - block - index - stream footer. And structure of the stream header is: header magic bytes - stream flags - crc code.
I can find the stream header from my file, but after the first sixteen bytes it doesn't seem to follow the description anymore.
First six bytes above are clearly the magic bytes. Next two bytes are the stream flags. Stream flags indicate that CRC64 is being used, so the CRC code takes next eight bytes. Seventeenth byte (I count from one) should then be the first byte of the first block.
Structure of a block is: block header - compressed data - block padding - check. Structure of block header should be: block header size - block flags - compressed size - uncompressed size - list of filter flags - header padding - CRC. So the seventeenth byte should then be block header size (0x16 in my file). That's possible, but the eighteenth byte seems a bit weird. It should be the block flags bit field. In my file it's null - so no flags set. Not even the number of filters, which according to description should be 1-4.
Since bits 6 and 7 of the block flags are also zeros, compressed and uncompressed sizes should not be present in the file and the next bytes should be the list of filter flags. Structure of the list is: filter ID - size of properties - filter properties. Nineteenth byte should then be filter ID. This is null in my file which is not any of officially defined filter IDs. If it would be a custom ID it would take nine bytes, but as I understand the encoding of sizes described in section 1.2 of the description it can't be, since according to the description: "All but the last byte of the multibyte representation have the highest (eighth) bit set.", but in my file the twentieth byte is also null.
So is there something I don't understand or is the file not following the description?
I asked the question a bit hastily and came up with a solution myself. Just in case someone would be interested, I answer my own question.
I had misunderstood the meaning of the stream flags in stream header. They don't affect the CRC code in the header (which is always CRC32), just CRCs in the stream itself (as the name stream flags implies). This means that the CRC in the header is only four bytes long and thus bytes 13-24 form a valid block header.
In the block header, the block flags field is again a null byte, which I saw as a problem before. According to the description, number of filters should be between 1 and 4. So I expected a decimal value of at least one. Since number of filters is expressed with two bits the maximum decimal value is 3, but number of possible values (zero included) is of course four and thus zero means one filter.
Since also the last two bits of the block flags are zeros, no compressed size or uncompressed size fields are present in the block header. This means that bytes 15-17 are the filter flags for the first (and only) filter. Filter id 0x21 is the id of LZMA2 filter. Size of properties 0x01 means size of one byte. And dictionary size 0x16 means size of 4096 KiB.