I have been working for two weeks on JamVM, a small but powerful Java Virtual Machine.
Now I am trying to figure how the memory is implemented and I am stuck on two C stupid problems:
char *mem = (char*)mmap(0, args->max_heap, PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON, -1, 0);
--> The -1 parameter stands for a file descriptor, what does that mean? (I have aleady read the mmap man, but haven't found it, maybe I misunderstood...).
heapbase = (char*)(((uintptr_t)mem+HEADER_SIZE+OBJECT_GRAIN-1&)~(OBJECT_GRAIN-1)) HEADER_SIZE;
--> What is 1& ? I don't find it in the C specification...
Thanks,
Yann
In answer to your first question. From the man page.
fd should be a valid file descriptor, unless MAP_ANONYMOUS is set. If MAP_ANONYMOUS is set, then fd is ignored on Linux. However, some implementations require fd to be -1 if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable applications should ensure this.
So it's -1 because MAP_ANONYMOUS is being used.
You use the file descriptor when you have an open file that you want to map into memory. In this case, you're creating an anonymous map (one not backed by a file) so the file descriptor isn't needed. Some implementations ignore fd for anonymous maps, some require it to be -1.
The second question is a syntax error (probably a typo). It probably should be something like:
heapbase = (char*)(((uintptr_t)mem+HEADER_SIZE+OBJECT_GRAIN-1)
&~(OBJECT_GRAIN-1)) - HEADER_SIZE;
In that case, OBJECT_GRAIN will be a power of two and it's a way to get alignment to that power. For example, if it were 8, then ~(OBJECT_GRAIN-1) would be ~7 (~00...001112, which is ~11...110002) which, when ANDed with a value, could be used to force that value to the multiple-of-8 less than or equal to it.
In fact, it's definitely a transcription error somewhere (not necessarily you) because, when I download the JamVM from here and look in src/alloc.c, I get:
void initialiseAlloc(InitArgs *args) {
char *mem = (char*)mmap(0, args->max_heap, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANON, -1, 0);
:
<< a couple of irrelevant lines >>
:
/* Align heapbase so that start of heap + HEADER_SIZE is object aligned */
heapbase = (char*)(((uintptr_t)mem+HEADER_SIZE+OBJECT_GRAIN-1)&
~(OBJECT_GRAIN-1))-HEADER_SIZE;
(note that your version is also missing the - immediately before HEADER_SIZE, something else that points to transcription problems).
Related
I have a PCIe endpoint device connected to the host. The ep's (endpoints) 512MB BAR is mmapped and memcpy is used to transfer data. Memcpy is quite slow (~2.5s). When I don't map all of the BAR (100bytes), but run memcpy for the full 512MB, I get a segfault within 0.5s, however when reading back the end of the BAR, the data shows the correct data. Meaning that the data reads the same as if I did mmap the whole BAR space.
How is the data being written and why is it so much faster than doing it the correct way (without the segfault)?
Code to map the whole BAR (takes 2.5s):
fd = open(filename, O_RDWR | O_SYNC)
map_base = mmap(NULL, 536870912, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int rand_fd = open(infile, O_RDONLY);
rand_base = mmap(0, 536870912, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, 536870912);
if(munmap(map_base, map_size) == -1)
{
PRINT_ERROR;
}
close(fd);
Code to map only 100 bytes (takes 0.5s):
fd = open(filename, O_RDWR | O_SYNC)
map_base = mmap(NULL, 100, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int rand_fd = open(infile, O_RDONLY);
rand_base = mmap(0, 536870912, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, 536870912);
if(munmap(map_base, map_size) == -1)
{
PRINT_ERROR;
}
close(fd);
To check the written data, I am using pcimem
https://github.com/billfarrow/pcimem
Edit: I was being dumb while consistent data was being 'written' after the segfault, it was not the data that it should have been. Therefore my conclusion that memcpy was completing after the segfault was false. I am accepting the answer as it provided me useful information.
Assuming filename is just an ordinary file (to save the data), leave off O_SYNC. It will just slow things down [possibly, a lot].
When opening the BAR device, consider using O_DIRECT. This may minimize caching effects. That is, if the BAR device does its own caching, eliminate caching by the kernel, if possible.
How is the data being written and why is it so much faster than doing it the correct way (without the segfault)?
The "short" mmap/read is not working. The extra data comes from the prior "full" mapping. So, your test isn't valid.
To ensure consistent results, do unlink on the output file. Do open with O_CREAT. Then, use ftruncate to extend the file to the full size.
Here is some code to try:
#define SIZE (512 * 1024 * 1024)
// remove the output file
unlink(filename);
// open output file (create it)
int ofile_fd = open(filename, O_RDWR | O_CREAT,0644)
// prevent segfault by providing space in the file
ftruncate(ofile_fd,SIZE);
map_base = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, ofile_fd, 0);
// use O_DIRECT to minimize caching effects when accessing the BAR device
#if 0
int rand_fd = open(infile, O_RDONLY);
#else
int rand_fd = open(infile, O_RDONLY | O_DIRECT);
#endif
rand_base = mmap(0, SIZE, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, SIZE);
if (munmap(map_base, map_size) == -1) {
PRINT_ERROR;
}
// close the output file
close(ofile_fd);
Depending upon the characteristics of the BAR device, to minimize the number of PCIe read/fetch/transaction requests, it may be helpful to ensure that it is being accessed as 32 bit (or 64 bit) elements.
Does the BAR space allow/support/encourage access as "ordinary" memory?
Usually, memcpy is smart enough to switch to "wide" memory access automatically (if memory addresses are aligned--which they are here). That is, memcpy will automatically use 64 bit fetches, with movq or possibly by using some XMM instructions, such as movdqa
It would help to know exactly which BAR device(s) you have. The datasheet/appnote should give enough information.
UPDATE:
Thanks for the sample code. Unfortunately, aarch64-gcc gives 'O_DIRECT undeclared' for some reason. Without using that flag, the speed is the same as my original code.
Add #define _GNU_SOURCE above any #include to resolve O_DIRECT
The PCIe device is an FPGA that we are developing. The bitstream is currently the Xilinx DMA example code. The BAR is just 512MB of memory for the system to R/W to. –
userYou
Serendipitously, my answer was based on my experience with access to the BAR space of a Xilinx FPGA device (it's been a while, circa 2010).
When we were diagnosing speed issues, we used a PCIe bus analyzer. This can show the byte width of the bus requests the CPU has requested. It also shows the turnaround time (e.g. Bus read request time until data packet from device is returned).
We also had to adjust the parameters in the PCIe config registers (e.g. transfer size, transaction replay) for the device/BAR. This was trial-and-error and we (I) tried some 27 different combinations before deciding on the optimum config
On an unrelated arm system (e.g. nVidia Jetson) about 3 years ago, I had to do memcpy to/from the GPU memory. It may have just been the particular cross-compiler I was using, but the disassembly of memcpy showed that it only used bytewide transfers. That is, it wasn't as smart as its x86 counterpart. I wrote/rewrote a version that used unsigned long long [and/or unsigned __int128] transfers. This sped things up considerably. See below.
So, you may wish to disassemble the generated memcpy code. Either the library function and/or code that it may inline into your function.
Just a thought ... If you're just wanting a bulk transfer, you may wish to have the device driver for the device program the DMA engine on the FPGA. This might be handled more effectively with a custom ioctl call to the device driver that accepts a custom struct describing the desired transfer (vs. read or mmap from userspace).
Are you writing a custom device driver for the device? Or, are you just using some generic device driver?
Here's what I had to do to get a fast memcpy on arm. It generates ldp/stp asm instructions.
// qcpy.c -- fast memcpy
#include <string.h>
#include <stddef.h>
#ifndef OPT_QMEMCPY
#define OPT_QMEMCPY 128
#endif
#ifndef OPT_QCPYIDX
#define OPT_QCPYIDX 1
#endif
// atomic type for qmemcpy
#if OPT_QMEMCPY == 32
typedef unsigned int qmemcpy_t;
#elif OPT_QMEMCPY == 64
typedef unsigned long long qmemcpy_t;
#elif OPT_QMEMCPY == 128
typedef unsigned __int128 qmemcpy_t;
#else
#error qmemcpy.c: unknown/unsupported OPT_QMEMCPY
#endif
typedef qmemcpy_t *qmemcpy_p;
typedef const qmemcpy_t *qmemcpy_pc;
// _qmemcpy -- fast memcpy
// RETURNS: number of bytes transferred
size_t
_qmemcpy(qmemcpy_p dst,qmemcpy_pc src,size_t size)
{
size_t cnt;
size_t idx;
cnt = size / sizeof(qmemcpy_t);
size = cnt * sizeof(qmemcpy_t);
if (OPT_QCPYIDX) {
for (idx = 0; idx < cnt; ++idx)
dst[idx] = src[idx];
}
else {
for (; cnt > 0; --cnt, ++dst, ++src)
*dst = *src;
}
return size;
}
// qmemcpy -- fast memcpy
void
qmemcpy(void *dst,const void *src,size_t size)
{
size_t xlen;
// use fast memcpy for aligned size
if (OPT_QMEMCPY > 0) {
xlen = _qmemcpy(dst,src,size);
src += xlen;
dst += xlen;
size -= xlen;
}
// copy remainder with ordinary memcpy
if (size > 0)
memcpy(dst,src,size);
}
UPDATE #2:
Speaking of serendipity, I am using a Jetson Orin. That is very interesting about the byte-wise behavior.
Just a thought ... If you have a Jetson in the same system as the FPGA, you might get DMA action by judicious use of cuda
Due to requirements, I cannot use any custom kernel modules so I am trying to do it all in userspace.
That is a harsh mistress to serve ... With custom H/W, it is almost axiomatic that you can have a custom device driver. So, the requirement sounds like a marketing/executive one rather than a technical one. If it's something like not being able to ship a .ko file because you don't know the target kernel version, it is possible to ship the driver as a .o and defer the .ko creation to the install script.
We want to use the DMA engine, but I am hiking up the learning curve on this one. We are using DMA in the FPGA, but I thought that as long as we could write to the address specified in the dtb, that meant the DMA engine was set up and working. Now I'm wondering if I have completely misunderstood that part. –
userYou
You probably will not get DMA doing that. If you start the memcpy, how does the DMA engine know the transfer length?
You might have better luck using read/write vs mmap to get DMA going, depending upon the driver.
But, if it were me, I'd keep the custom driver option open:
If you have to tweak/modify the BAR config registers on driver/system startup, I can't recall if it's even possible to map the config registers to userspace.
When doing mmap, the device may be treated as the "backing store" for the mapping. That is, there is still an extra layer of kernel buffering [just like there is when mapping an ordinary file]. The device memory is only updated periodically from the kernel [buffer] memory.
A custom driver can set up a [guaranteed] direct mapping, using some trickery that only the kernel/driver has access to.
Historical note:
When I last worked with the Xilinx FPGA (12 years ago), the firmware loader utility (provided by Xilinx in both binary and source form), would read in bytes from the firmware/microcode .xsvf file used (e.g.) fscanf(fi,"%c",&myint) to get the bytes.
This was horrible. I refactored the utility to fix that and the processing of the state machine and reduced the load time from 15 minutes to 45 seconds.
Hopefully, Xilinx has fixed the utility by now.
The function call to mmap64() is as follows:
addr = (unsigned char*) mmap64(NULL, regionSize, PROT_READ|PROT_WRITE, MAP_SHARED, FileDesc, (unsigned long long)regionAddr);
The arguments typically have values as follows:
regionSize = 0x20000;
FileDesc = 27;
regionAddr = 0x332C0000;
Obviously in the code these values are not hard-coded like that, but I just want to show you what the typical values for them will be.
The problem:
mmap64() call works perfectly in Red Hat Linux 6.6, kernel version: 2.6.32-504.16.2.el6.x86_64. It fails in Red Hat Linux 7.2, kernel version: 3.10.0-327.13.1.el7.x86_64.
No difference in code as far as I'm aware of.
The errno returned is "Invalid argument" or errno #22 (EINVAL). Looking at this reference http://linux.die.net/man/3/mmap64, I see 3 possiblilities for the EINVAL error:
We don't like addr, length, or offset (e.g., they are too large, or not aligned on a page boundary). -> most likely culprit in my situation.
(since Linux 2.6.12) length was 0. -> Impossible, I checked length (regionSize) value in debug print, it was 0x20000.
flags contained neither MAP_PRIVATE or MAP_SHARED, or contained both of these values. -> Can't be this as you can see from my function call, only MAP_SHARED flag was given as argument.
So I'm stuck at the moment. Not sure how to debug this. This problem is 100% reproducible. Anyone have tips on what could have changed between the two OS versions to cause this?
Transferring comments into an answer (with some comments trimmed as not relevant).
Why not simply use mmap() without the suffix if you're building 64-bit executables? Does that make any difference to your problem?
However, I think your problem is what you call regionAddr. The last argument to mmap64() is called offset in the synopsis, and:
offset must be a multiple of the page size as returned by sysconf(_SC_PAGE_SIZE).
Is your value of regionAddr a multiple of the page size? It looks to me like it has too few trailing zeros in the hex (it is a multiple of 512, but not a multiple of 4K or larger).
Note that the question originally had a different value on display for regionAddr — see also the comments below.
regionAddr = 0x858521600;
and
addr = (unsigned char*) mmap64(NULL, regionSize, PROT_READ|PROT_WRITE, MAP_SHARED, FileDesc, (unsigned long long)regionAddr);
With the revised information (that the value in regionAddr is 0x332C0000 or decimal 828521600), it is less than obvious what is going wrong.
I did write to disk using C code.
First I tried with malloc and found that write did not work (write returned -1):
fd = open('/dev/sdb', O_DIRECT | O_SYNC | O_RDWR);
void *buff = malloc(512);
lseek(fd, 0, SEEK_SET);
write(fd, buff, 512);
Then I changed the second line with this and it worked:
void *buff;
posix_memalign(&buff,512,512);
However, when I changed the lseek offset to 1: lseek(fd, 1, SEEK_SET);, write did not work again.
First, why didn't malloc work?
Then, I know that in my case, posix_memalign guarantees that start address of memory alignment must be multiple of 512. But, should not memory alignment and write be a separate process? So why I could not write to any offset that I want?
From the Linux man page for open(2):
The O_DIRECT flag may impose alignment restrictions on the length
and address of user-space buffers and the file offset of I/Os.
And:
Under Linux 2.4, transfer sizes, and the alignment of the user
buffer and the file offset must all be multiples of the logical block
size of the filesystem. Under Linux 2.6, alignment to 512-byte
boundaries suffices.
The meaning of O_DIRECT is to "try to minimize cache effects of the I/O to and from this file", and if I understand it correctly it means that the kernel should copy directly from the user-space buffer, thus perhaps requiring stricter alignment of the data.
Maybe the documentation doesn't say, but it's quite possible that write and reads to/from a block device is required to be aligned and for entire blocks to succeed (this would explain why you get failure in your first and last cases but not in the second). If you use linux the documentation of open(2) basically says this:
The O_DIRECT flag may impose alignment restrictions on the length and
address of user-space buffers and the file offset of I/Os. In Linux
alignment restrictions vary by file system and kernel version and
might be absent entirely. However there is cur‐
rently no file system-independent interface for an application to discover these restrictions for a given file or file system. Some
file systems provide their own interfaces for doing so, for example
the XFS_IOC_DIOINFO operation in xfsctl(3).
Your code shows lack of error handling. Every line in the code contains functions that may fail and open, lseek and write also reports the cause of error in errno. So with some kind of error handling it would be:
fd = open('/dev/sdb', O_DIRECT | O_SYNC | O_RDWR);
if( fd == -1 ) {
perror("open failed");
return;
}
void *buff = malloc(512);
if( !buff ) {
printf("malloc failed");
return;
}
if( lseek(fd, 0, SEEK_SET) == (off_t)-1 ) {
perror("lseek failed");
free(buff);
return;
}
if( write(fd, buff, 512) == -1 ) {
perror("write failed");
free(buff);
return;
}
in that case you would at least get a more detailed explaination on what goes wrong. In this case I suspect that you get EIO (Input/output error) from the write call.
Note that the above maybe isn't complete errorhandling as perror and printf themselves can fail (and you might want to do something about that possibility).
I have a small example program which simply fopens a file and uses fgets to read it. Using strace, I notice that the first call to fgets runs a mmap system call, and then read system calls are used to actually read the contents of the file. on fclose, the file is munmaped. If I instead open read the file with open/read directly, this obviously does not occur. I'm curious as to what is the purpose of this mmap is, and what it is accomplishing.
On my Linux 2.6.31 based system, when under heavy virtual memory demand these mmaps will sometimes hang for several seconds, and appear to me to be unnecessary.
The example code:
#include <stdlib.h>
#include <stdio.h>
int main ()
{
FILE *f;
if ( NULL == ( f=fopen( "foo.txt","r" )))
{
printf ("Fail to open\n");
}
char buf[256];
fgets(buf,256,f);
fclose(f);
}
And here is the relevant strace output when the above code is run:
open("foo.txt", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=9, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb8039000
read(3, "foo\nbar\n\n"..., 4096) = 9
close(3) = 0
munmap(0xb8039000, 4096) = 0
It's not the file that is mmap'ed - in this case mmap is used anonymously (not on a file), probably to allocate memory for the buffer that the consequent reads will use.
malloc in fact results in such a call to mmap. Similarly, the munmap corresponds to a call to free.
The mmap is not mapping the file; instead it's allocating memory for the stdio FILE buffering. Normally malloc would not use mmap to service such a small allocation, but it seems glibc's stdio implementation is using mmap directly to get the buffer. This is probably to ensure it's page-aligned (though posix_memalign could achieve the same thing) and/or to make sure closing the file returns the buffer memory to the kernel. I question the usefulness of page-aligning the buffer. Presumably it's for performance, but I can't see any way it would help unless the file offset you're reading from is also page-aligned, and even then it seems like a dubious micro-optimization.
from what i have read memory mapping functions are useful while handling large files. now the definition of large is something i have no idea about. but yes for the large files they are significantly faster as compared to the 'buffered' i/o calls.
in the example that you have posted i think the file is opened by the open() function and mmap is used for allocating memory or something else.
from the syntax of mmap function this can be seen clearly:
void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off);
the second last parameter takes the file descriptor which should be non-negative.
while in the stack trace it is -1
Source code of fopen in glibc shows that mmap can be actually used.
https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/iofopen.c;h=965d21cd978f3acb25ca23152993d9cac9f120e3;hb=HEAD#l36
For my bachelor thesis i want to visualize the data remanence of memory and how it persists after rebooting a system.
I had the simple idea to mmap a picture to memory, shut down my computer, wait x seconds, boot the computer and see if the picture is still there.
int mmap_lena(void)
{
FILE *fd = NULL;
size_t lena_size;
void *addr = NULL;
fd = fopen("lena.png", "r");
fseek(fd, 0, SEEK_END);
lena_size = ftell(fd);
addr = mmap((void *) 0x12345678, (size_t) lena_size, (int) PROT_READ, (int) MAP_SHARED, (int) fileno(fd), (off_t) 0);
fprintf(stdout, "Addr = %p\n", addr);
munmap((void *) addr, (size_t) lena_size);
fclose(fd);
fclose(fd_log);
return EXIT_SUCCESS;
}
I ommitted checking return values for clarities sake.
So after the mmap i tried to somehow get the address, but i usually end up with a segmentation fault as to my understanding the memory is protected by my operating system.
int fetch_lena(void)
{
FILE *fd = NULL;
FILE *fd_out = NULL;
size_t lenna_size;
FILE *addr = (FILE *) 0x12346000;
fd = fopen("lena.png", "r");
fd_out = fopen("lena_out.png", "rw");
fseek(fd, 0, SEEK_END);
lenna_size = ftell(fd);
// Segfault
fwrite((FILE *) addr, (size_t) 1, (size_t) lenna_size, (FILE *) fd_out);
fclose(fd);
fclose(fd_out);
return 0;
}
Please also note that i hard coded the adresses in this example, so whenever you run mmap_lena the value i use in fetch_lena could be wrong as the operating system takes the first parameter to mmap only as a hint (on my system it always defaults to 0x12346000 somehow).
If there is any trivial coding error i am sorry as my C skills have not fully developed.
I would like to now if there is any way to get to the data i want without implementing any malloc hooks or memory allocator hacks.
Thanks in advance,
David
One issue you have is that you are getting back a virtual address, not the physical address where the memory resides. Next time you boot, the mapping probably won't be the same.
This can definitly be done within a kernel module in Linux, but I don't think there is any sort of API in userspace you can use.
If you have permission ( and I assume you could be root on this machine if you are rebooting it ), then you can peek at /dev/mem to see the actual phyiscal layout. Maybe you should try sampling values, reboot, and see how many of those values persisted.
There is a similar project where a cold boot attack is demonstrated. The source code is available, maybe you can get some inspiration there.
However, AFAIR they read out the memory without loading an OS first and therefore do not have to mess with the OSs memory protection. Maybe you should try this too to avoid memory being overwritten or cleared by the OS after boot.
(Also check the video on the site, it's pretty impressive ;)
In the question Direct Memory Access in Linux we worked out most of the fundamentals needed to accomplish this. Note, mmap() is not the answer to this for exactly the reasons that were stated by others .. you need a real address, not virtual, which you can only get inside the kernel (or by writing a driver to relay one to userspace).
The simplest method would be to write a character device driver that can be read or written to, with an ioctl to give you a valid start or ending address. Again, if you want pointers on the memory management functions to use in the kernel, see the question that I've linked to .. most of it was worked out in the comments in the first (and accepted) answer.
Your test code looks odd
FILE *addr = (FILE *) 0x12346000;
fwrite((FILE *) fd_out, (size_t) 1,
(size_t) lenna_size, (FILE *) addr);
You can't just cast an integer to a FILE pointer and expect to get something sane.
Did you also switch the first and last argument to fwrite ? The last argument is supposed to be the FILE* to write to.
You probably want as little OS as possible for this purpose; the more software you load, the more chances of overwriting something you want to examine.
DOS might be a good bet; it uses < 640k of memory. If you don't load HIMEM and instead write your own (assembly required) routine to jump into pmode, copy a block of high memory into low memory, then jump back into real mode, you could write a mostly-real-mode program which can dump out the physical ram (minus however much the BIOS, DOS and your app use). It could dump it to a flash disc or something.
Of course the real problem may be that the BIOS clears the memory during POST.
I'm not familiar with Linux, but you'll likely need to write a device driver. Device drivers must have some way to convert virtual memory addresses to physical memory addresses for DMA purposes (DMA controllers only deal with physical memory addresses). You should be able to use those interfaces to deal directly with physical memory.
I don't say it's the lowest effort, but for completeness sake,
You can Compile a MMU-less Linux Kernel
And then you can have a blast as all addresses are real.
Note 1: You might still get errors if you access few hardware/bios mapped address spaces.
Note 2: You don't access memory using files in this case, you just assign an address to a pointer and read it's content.
int* pStart = (int*)0x600000;
int* pEnd = (int*)0x800000;
for(int* p = pStart; p < pEnd; ++p)
{
// do what you want with *p
}