I'm trying to assign a structure to a particular memory type(write-back) location. My ultimate goal is to read from that memory location into L1D cache.
To achieve this, I did the following,
Using the sample code from MTRR (Memory Type Range Register) control to declare a memory range as a write-back type.
Then mmapped that address using /dev/mem, using the hint from Accessing uncachable region using mmap and /proc/mtrr
Make a pointer of my structure and assign that mmaped pointer to my structure pointer.
My code is :
#define MTRR_BASE 0xf80000
#define MTRR_SIZE 0x40000
#define MTRR_TYPE "write-back"
#define KEY_LEN 128
#define ERRSTRING strerror (errno)
static struct CACHE_ENV{
unsigned char in[KEY_LEN];
unsigned char out[KEY_LEN];
}cacheEnv;
static char *mtrr_strings[MTRR_NUM_TYPES] =
{
"uncachable", /* 0 */
"write-combining", /* 1 */
"?", /* 2 */
"?", /* 3 */
"write-through", /* 4 */
"write-protect", /* 5 */
"write-back", /* 6 */
};
int mtrr_mmap(){
int fd;
fd = open("/dev/mem", O_RDWR|O_SYNC);
if (fd == -1) {
printf("\n Error opening /dev/mem");
return 0;
}
unsigned char *addr = (unsigned char*)mmap((void *)MTRR_BASE, MTRR_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0x0);
if (addr == MAP_FAILED) {
printf("\n mmap() failed");
}
struct CACHE_ENV *enc_str =(struct CACHE_ENV *) addr;
printf("size of enc_str is %ld\n", sizeof (*enc_str));
// will read 64 bytes at a time from *enc_str to load struct CACHE_ENV into L1d cache
/**
*/
// unmmap
if (munmap(addr, MTRR_SIZE) == -1) {
printf("\n Unmapping failed");
return 0;
}
printf("\n Success......\n");
return 0;
}
void mtrr_add(){
int fd;
struct mtrr_sentry sentry;
sentry.base = MTRR_BASE;
sentry.size = MTRR_SIZE;
for (sentry.type = 0; sentry.type < MTRR_NUM_TYPES; ++sentry.type){
if (strcmp (MTRR_TYPE, mtrr_strings[sentry.type]) == 0) break;
}
if (sentry.type >= MTRR_NUM_TYPES){
fprintf (stderr, "Illegal type: \"%s\"\n", MTRR_TYPE);
exit (2);
}
if ( ( fd = open ("/proc/mtrr", O_WRONLY, 0) ) == -1 ){
if (errno == ENOENT){
fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n",
stderr);
exit (3);
}
fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING);
exit (4);
}
// adding MTRR entry
if (ioctl (fd, MTRRIOC_ADD_ENTRY, &sentry) == -1){
fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING);
exit (5);
}
// call memory map
mtrr_mmap();
fprintf (stderr, "Sleeping for 15 seconds so you can see the new entry\n");
sleep (15);
close (fd);
fputs ("I've just closed /proc/mtrr so now the new entry should be gone\n", stderr);
}
int main (){
mtrr_add();
}
This code compiles and runs. After running the code, cat /proc/mtrr shows,
reg00: base=0x080000000 ( 2048MB), size= 2048MB, count=1: uncachable
reg01: base=0x070000000 ( 1792MB), size= 256MB, count=1: uncachable
.
reg06: base=0x000f80000 ( 15MB), size= 256KB, count=1: write-back
The above output shows that a new /proc/mtrr entry of the desired type is created. As my structure size is 256B, it should fit into L1d cache.
However, I'm confused, if this code is doing what I'm trying to achieve, as I do not know how to verify if my structure is in the L1d cache. So, my question is,
can I use MTRR as I use in mtrr_add() to declare a memory range of a particular type and then assign a structure to that memory range?
Is this struct CACHE_ENV *enc_str =(struct CACHE_ENV *) addr; correct way to point to a particular address? I mean, if I assign a value like enc_str->out = "abcd" , will it end up in the addr?
If you have any suggestions on how to achieve my goal or If I'm doing anything wrong, please let me know.
I have a file with some data, which is also memory-mapped. So that I have both file descriptor and the pointer to the mapped pages. Mostly the data is only read from the mapping, but eventually it's also modified.
The modification consists of modifying some data within the file (sort of headers update), plus appending some new data (i.e. writing post the current end of the file).
This data structure is accessed from different threads, and to prevent collisions I synchronize access to it (mutex and friends).
During the modification I use both the file mapping and the file descriptor. Headers are updated implicitly by modifying the mapped memory, whereas the new data is written to the file by the appropriate API (WriteFile on windows, write on posix). Worth to note that the new data and the headers belong to different pages.
Since the modification changes the file size, the memory mapping is re-initialized after every such a modification. That is, it's unmapped, and then mapped again (with the new size).
I realize that writes to the mapped memory are "asynchronous" wrt file system, and order is not guaranteed, but I thought there was no problem because I explicitly close the file mapping, which should (IMHO) act as a sort of a flushing point.
Now this works without problem on windows, but on linux (android to be exact) eventually the mapped data turns-out to be inconsistent temporarily (i.e. data is ok when retrying). Seems like it doesn't reflect the newly-appended data.
Do I have to call some synchronization API to ensure the data if flushed properly? If so, which one should I use: sync, msync, syncfs or something different?
Thanks in advance.
EDIT:
This is a pseudo-code that illustrates the scenario I'm dealing with.
(The real code is more complex of course)
struct CompressedGrid
{
mutex m_Lock;
int m_FileHandle;
void* m_pMappedMemory;
Hdr* get_Hdr() { return /* the mapped memory with some offset*/; }
void SaveGridCell(int idx, const Cell& cCompressed)
{
AutoLock scope(m_Lock);
// Write to mapped memory
get_Hdr()->m_pCellOffset[Idx] = /* current end of file */;
// Append the data
lseek64(m_FileHandle, 0, FILE_END);
write(m_FileHandle, cCompressed.pPtr, cCompressed.nSize);
// re-map
munmap(...);
m_pMappedMemory = mmap(...); // specify the new file size of course
}
bool DecodeGridCell(int idx, Cell& cRaw)
{
AutoLock scope(m_Lock);
uint64_t nOffs = get_Hdr()->m_pCellOffset[Idx] = /* ;
if (!nOffs)
return false; // unavail
const uint8_t* p = m_pMappedMemory + nOffs;
cRaw.DecodeFrom(p); // This is where the problem appears!
return true;
}
Use addr = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, offset) to map the file.
If the size of the file changes, use newaddr = mremap(addr, len, newlen, MREMAP_MAYMOVE) to update the mapping to reflect it. To extend the file, use ftruncate(fd, newlen) before remapping the file.
You can use mprotect(addr, len, protflags) to change the protection (read/write) on any pages in the mapping (both must be aligned on a page boundary). You can also tell the kernel about your future accesses via madvise(), if the mapping is too large to fit in memory at once, but the kernel seems pretty darned good at managing readahead etc. even without those.
When you make changes to the mapping, use msync(partaddr, partlen, MS_SYNC | MS_INVALIDATE) or msync(partaddr, partlen, MS_ASYNC | MS_INVALIDATE) to ensure the changes int partlen chars from partaddr forward are visible to other mappings and file readers. If you use MS_SYNC, the call returns only when the update is complete. The MS_ASYNC call tells the kernel to do the update, but won't wait until it is done. If there are no other memory maps of the file, the MS_INVALIDATE does nothing; but if there are, that tells the kernel to ensure the changes are reflected in those too.
In Linux kernels since 2.6.19, MS_ASYNC does nothing, as the kernel tracks the changes properly anyway (no msync() is needed, except possibly before munmap()). I don't know if Android kernels have patches that change that behaviour; I suspect not. It is still a good idea to keep them in the code, for portability across POSIXy systems.
mapped data turns-out to be inconsistent temporarily
Well, unless you do use msync(partaddr, partlen, MS_SYNC | MS_INVALIDATE), the kernel will do the update when it sees best.
So, if you need some changes to be visible to file readers before proceeding, use msync(areaptr, arealen, MS_SYNC | MS_INVALIDATE) in the process doing those updates.
If you don't care about the exact moment, use msync(areaptr, arealen, MS_ASYNC | MS_INVALIDATE). It'll be a no-op on current Linux kernels, but it's a good idea to keep them for portability (perhaps commented out, if necessary for performance) and to remind developers about the (lack of) synchronization expectations.
As I commented to OP, I cannot observe the synchronization issues on Linux at all. (That does not mean it does not happen on Android, because Android kernels are derivatives of Linux kernels, not exactly the same.)
I do believe the msync() call is not needed on Linux kernels since 2.6.19 at all, as long as the mapping uses flags MAP_SHARED | MAP_NORESERVE, and the underlying file is not opened using the O_DIRECT flag. The reason for this belief is that in this case, both mapping and file accesses should use the exact same page cache pages.
Here are two test programs, that can be used to explore this on Linux. First, a single-process test, test-single.c:
#define _POSIX_C_SOURCE 200809L
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <signal.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
static inline int read_from(const int fd, void *const to, const size_t len, const off_t offset)
{
char *p = (char *)to;
char *const q = (char *)to + len;
ssize_t n;
if (lseek(fd, offset, SEEK_SET) != offset)
return errno = EIO;
while (p < q) {
n = read(fd, p, (size_t)(q - p));
if (n > 0)
p += n;
else
if (n != -1)
return errno = EIO;
else
if (errno != EINTR)
return errno;
}
return 0;
}
static inline int write_to(const int fd, const void *const from, const size_t len, const off_t offset)
{
const char *const q = (const char *)from + len;
const char *p = (const char *)from;
ssize_t n;
if (lseek(fd, offset, SEEK_SET) != offset)
return errno = EIO;
while (p < q) {
n = write(fd, p, (size_t)(q - p));
if (n > 0)
p += n;
else
if (n != -1)
return errno = EIO;
else
if (errno != EINTR)
return errno;
}
return 0;
}
int main(int argc, char *argv[])
{
unsigned long tests, n, merrs = 0, werrs = 0;
size_t page;
long *map, data[2];
int fd;
char dummy;
if (argc != 3) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s FILENAME COUNT\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "This program will test synchronization between a memory map\n");
fprintf(stderr, "and reading/writing the underlying file, COUNT times.\n");
fprintf(stderr, "\n");
return EXIT_FAILURE;
}
if (sscanf(argv[2], " %lu %c", &tests, &dummy) != 1 || tests < 1) {
fprintf(stderr, "%s: Invalid number of tests to run.\n", argv[2]);
return EXIT_FAILURE;
}
/* Create the file. */
page = sysconf(_SC_PAGESIZE);
fd = open(argv[1], O_RDWR | O_CREAT | O_EXCL, 0644);
if (fd == -1) {
fprintf(stderr, "%s: Cannot create file: %s.\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
if (ftruncate(fd, page) == -1) {
fprintf(stderr, "%s: Cannot resize file: %s.\n", argv[1], strerror(errno));
unlink(argv[1]);
return EXIT_FAILURE;
}
/* Map it. */
map = mmap(NULL, page, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_NORESERVE, fd, 0);
if (map == MAP_FAILED) {
fprintf(stderr, "%s: Cannot map file: %s.\n", argv[1], strerror(errno));
unlink(argv[1]);
close(fd);
return EXIT_FAILURE;
}
/* Test loop. */
for (n = 0; n < tests; n++) {
/* Update map. */
map[0] = (long)(n + 1);
map[1] = (long)(~n);
/* msync(map, 2 * sizeof map[0], MAP_SYNC | MAP_INVALIDATE); */
/* Check the file contents. */
if (read_from(fd, data, sizeof data, 0)) {
fprintf(stderr, "read_from() failed: %s.\n", strerror(errno));
munmap(map, page);
unlink(argv[1]);
close(fd);
return EXIT_FAILURE;
}
werrs += (data[0] != (long)(n + 1) || data[1] != (long)(~n));
/* Update data. */
data[0] = (long)(n * 386131);
data[1] = (long)(n * -257);
if (write_to(fd, data, sizeof data, 0)) {
fprintf(stderr, "write_to() failed: %s.\n", strerror(errno));
munmap(map, page);
unlink(argv[1]);
close(fd);
return EXIT_FAILURE;
}
merrs += (map[0] != (long)(n * 386131) || map[1] != (long)(n * -257));
}
munmap(map, page);
unlink(argv[1]);
close(fd);
if (!werrs && !merrs)
printf("No errors detected.\n");
else {
if (!werrs)
printf("Detected %lu times (%.3f%%) when file contents were incorrect.\n",
werrs, 100.0 * (double)werrs / (double)tests);
if (!merrs)
printf("Detected %lu times (%.3f%%) when mapping was incorrect.\n",
merrs, 100.0 * (double)merrs / (double)tests);
}
return EXIT_SUCCESS;
}
Compile and run using e.g.
gcc -Wall -O2 test-single -o single
./single temp 1000000
to test a million times, whether the mapping and the file contents stay in sync, when both accesses are done in the same process. Note that the msync() call is commented out, because on my machine it is not needed: I never see any errors/desynchronization during testing even without it.
The test rate on my machine is about 550,000 tests per second. Note that each tests does it both ways, so includes a read and a write. I just cannot get this to detect any errors. It is written to be quite sensitive to errors, too.
The second test program uses two child processes and a POSIX realtime signal to tell the other process to check the contents. test-multi.c:
#define _POSIX_C_SOURCE 200809L
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <signal.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#define NOTIFY_SIGNAL (SIGRTMIN+0)
int mapper_process(const int fd, const size_t len)
{
long value = 1, count[2] = { 0, 0 };
long *data;
siginfo_t info;
sigset_t sigs;
int signum;
if (fd == -1) {
fprintf(stderr, "mapper_process(): Invalid file descriptor.\n");
return EXIT_FAILURE;
}
data = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0);
if (data == MAP_FAILED) {
fprintf(stderr, "mapper_process(): Cannot map file.\n");
return EXIT_FAILURE;
}
sigemptyset(&sigs);
sigaddset(&sigs, NOTIFY_SIGNAL);
sigaddset(&sigs, SIGINT);
sigaddset(&sigs, SIGHUP);
sigaddset(&sigs, SIGTERM);
while (1) {
/* Wait for the notification. */
signum = sigwaitinfo(&sigs, &info);
if (signum == -1) {
if (errno == EINTR)
continue;
fprintf(stderr, "mapper_process(): sigwaitinfo() failed: %s.\n", strerror(errno));
munmap(data, len);
return EXIT_FAILURE;
}
if (signum != NOTIFY_SIGNAL)
break;
/* A notify signal was received. Check the write counter. */
count[ (data[0] == value) ]++;
/* Update. */
data[0] = value++;
data[1] = -(value++);
/* Synchronize */
/* msync(data, 2 * sizeof (data[0]), MS_SYNC | MS_INVALIDATE); */
/* And let the writer know. */
kill(info.si_pid, NOTIFY_SIGNAL);
}
/* Print statistics. */
printf("mapper_process(): %lu errors out of %lu cycles (%.3f%%)\n",
count[0], count[0] + count[1], 100.0 * (double)count[0] / (double)(count[0] + count[1]));
fflush(stdout);
munmap(data, len);
return EXIT_SUCCESS;
}
static inline int read_from(const int fd, void *const to, const size_t len, const off_t offset)
{
char *p = (char *)to;
char *const q = (char *)to + len;
ssize_t n;
if (lseek(fd, offset, SEEK_SET) != offset)
return errno = EIO;
while (p < q) {
n = read(fd, p, (size_t)(q - p));
if (n > 0)
p += n;
else
if (n != -1)
return errno = EIO;
else
if (errno != EINTR)
return errno;
}
return 0;
}
static inline int write_to(const int fd, const void *const from, const size_t len, const off_t offset)
{
const char *const q = (const char *)from + len;
const char *p = (const char *)from;
ssize_t n;
if (lseek(fd, offset, SEEK_SET) != offset)
return errno = EIO;
while (p < q) {
n = write(fd, p, (size_t)(q - p));
if (n > 0)
p += n;
else
if (n != -1)
return errno = EIO;
else
if (errno != EINTR)
return errno;
}
return 0;
}
int writer_process(const int fd, const size_t len, const pid_t other)
{
long data[2] = { 0, 0 }, count[2] = { 0, 0 };
long value = 0;
siginfo_t info;
sigset_t sigs;
int signum;
sigemptyset(&sigs);
sigaddset(&sigs, NOTIFY_SIGNAL);
sigaddset(&sigs, SIGINT);
sigaddset(&sigs, SIGHUP);
sigaddset(&sigs, SIGTERM);
while (1) {
/* Update. */
data[0] = ++value;
data[1] = -(value++);
/* then write the data. */
if (write_to(fd, data, sizeof data, 0)) {
fprintf(stderr, "writer_process(): write_to() failed: %s.\n", strerror(errno));
return EXIT_FAILURE;
}
/* Let the mapper know. */
kill(other, NOTIFY_SIGNAL);
/* Wait for the notification. */
signum = sigwaitinfo(&sigs, &info);
if (signum == -1) {
if (errno == EINTR)
continue;
fprintf(stderr, "writer_process(): sigwaitinfo() failed: %s.\n", strerror(errno));
return EXIT_FAILURE;
}
if (signum != NOTIFY_SIGNAL || info.si_pid != other)
break;
/* Reread the file. */
if (read_from(fd, data, sizeof data, 0)) {
fprintf(stderr, "writer_process(): read_from() failed: %s.\n", strerror(errno));
return EXIT_FAILURE;
}
/* Check the read counter. */
count[ (data[1] == -value) ]++;
}
/* Print statistics. */
printf("writer_process(): %lu errors out of %lu cycles (%.3f%%)\n",
count[0], count[0] + count[1], 100.0 * (double)count[0] / (double)(count[0] + count[1]));
fflush(stdout);
return EXIT_SUCCESS;
}
int main(int argc, char *argv[])
{
struct timespec duration;
double seconds;
pid_t mapper, writer, p;
size_t page;
siginfo_t info;
sigset_t sigs;
int fd, status;
char dummy;
if (argc != 3) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s FILENAME SECONDS\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "This program will test synchronization between a memory map\n");
fprintf(stderr, "and reading/writing the underlying file.\n");
fprintf(stderr, "The test will run for the specified time, or indefinitely\n");
fprintf(stderr, "if SECONDS is zero, but you can also interrupt it with\n");
fprintf(stderr, "Ctrl+C (INT signal).\n");
fprintf(stderr, "\n");
return EXIT_FAILURE;
}
if (sscanf(argv[2], " %lf %c", &seconds, &dummy) != 1) {
fprintf(stderr, "%s: Invalid number of seconds to run.\n", argv[2]);
return EXIT_FAILURE;
}
if (seconds > 0) {
duration.tv_sec = (time_t)seconds;
duration.tv_nsec = (long)(1000000000 * (seconds - (double)(duration.tv_sec)));
} else {
duration.tv_sec = 0;
duration.tv_nsec = 0;
}
/* Block INT, HUP, CHLD, and the notification signal. */
sigemptyset(&sigs);
sigaddset(&sigs, SIGINT);
sigaddset(&sigs, SIGHUP);
sigaddset(&sigs, SIGCHLD);
sigaddset(&sigs, NOTIFY_SIGNAL);
if (sigprocmask(SIG_BLOCK, &sigs, NULL) == -1) {
fprintf(stderr, "Cannot block the necessary signals: %s.\n", strerror(errno));
return EXIT_FAILURE;
}
/* Create the file. */
page = sysconf(_SC_PAGESIZE);
fd = open(argv[1], O_RDWR | O_CREAT | O_EXCL, 0644);
if (fd == -1) {
fprintf(stderr, "%s: Cannot create file: %s.\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
if (ftruncate(fd, page) == -1) {
fprintf(stderr, "%s: Cannot resize file: %s.\n", argv[1], strerror(errno));
unlink(argv[1]);
return EXIT_FAILURE;
}
close(fd);
fd = -1;
/* Ensure streams are flushed before forking. They should be, we're just paranoid here. */
fflush(stdout);
fflush(stderr);
/* Fork the mapper child process. */
mapper = fork();
if (mapper == -1) {
fprintf(stderr, "Cannot fork mapper child process: %s.\n", strerror(errno));
unlink(argv[1]);
return EXIT_FAILURE;
}
if (!mapper) {
fd = open(argv[1], O_RDWR);
if (fd == -1) {
fprintf(stderr, "mapper_process(): %s: Cannot open file: %s.\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
status = mapper_process(fd, page);
close(fd);
return status;
}
/* For the writer child process. (mapper contains the PID of the mapper process.) */
writer = fork();
if (writer == -1) {
fprintf(stderr, "Cannot fork writer child process: %s.\n", strerror(errno));
unlink(argv[1]);
kill(mapper, SIGKILL);
return EXIT_FAILURE;
}
if (!writer) {
fd = open(argv[1], O_RDWR);
if (fd == -1) {
fprintf(stderr, "writer_process(): %s: Cannot open file: %s.\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
status = writer_process(fd, page, mapper);
close(fd);
return status;
}
/* Wait for a signal. */
if (duration.tv_sec || duration.tv_nsec)
status = sigtimedwait(&sigs, &info, &duration);
else
status = sigwaitinfo(&sigs, &info);
/* Whatever it was, we kill the child processes. */
kill(mapper, SIGHUP);
kill(writer, SIGHUP);
do {
p = waitpid(-1, NULL, 0);
} while (p != -1 || errno == EINTR);
/* Cleanup. */
unlink(argv[1]);
printf("Done.\n");
return EXIT_SUCCESS;
}
Note that the child processes open the temporary file separately. To compile and run, use e.g.
gcc -Wall -O2 test-multi.c -o multi
./multi temp 10
The second parameter is the duration of the test, in seconds. (You can interrupt the testing safely using SIGINT (Ctrl+C) or SIGHUP.)
On my machine, the test rate is roughly 120,000 tests per second; the msync() call is commented out here also, because I don't ever see any errors/desynchronization even without it. (Plus, msync(ptr, len, MS_SYNC) and msync(ptr, len, MS_SYNC | MS_INVALIDATE) are horribly slow; with either, I can get less than 1000 tests per second, with absolutely no difference in the results. That's a 100x slowdown.)
The MAP_NORESERVE flag to mmap tells it to use the file itself as backing storage when under memory pressure, rather than swap. If you compile the code on a system that does not recognize that flag, you can omit it. As long as the mapping is not evicted from RAM, the flag does not affect the operation at all.
When i run the binary file of this code it throws an segmentation fault core dumped error. And the dmesg is:
segfault at 0 ip b7651747 sp bfb312d0 error 4 in libc-2.21.so[b75e9000+1b4000]
The code is for translation of virtual address to physical address.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <errno.h>
#include <fcntl.h>
#include <stdint.h>
// ORIG_BUFFER will be placed in memory and will then be changed to NEW_BUFFER
// They must be the same length
#define ORIG_BUFFER "Hello, World!"
#define NEW_BUFFER "Hello, Linux!"
// The page frame shifted left by PAGE_SHIFT will give us the physcial address of the frame
// Note that this number is architecture dependent. For me on x86_64 with 4096 page sizes,
// it is defined as 12. If you're running something different, check the kernel source
// for what it is defined as.
#define PAGE_SHIFT 12
#define PAGEMAP_LENGTH 8
void* create_buffer(void);
unsigned long get_page_frame_number_of_address(void *addr);
int open_memory(void);
void seek_memory(int fd, unsigned long offset);
int main(void) {
// Create a buffer with some data in it
void *buffer = create_buffer();
// Get the page frame the buffer is on
unsigned int page_frame_number = get_page_frame_number_of_address(buffer);
printf("Page frame: 0x%x\n", page_frame_number);
// Find the difference from the buffer to the page boundary
unsigned int distance_from_page_boundary = (unsigned long)buffer %
getpagesize();
// Determine how far to seek into memory to find the buffer
uint64_t offset = (page_frame_number << PAGE_SHIFT) + distance_from_page_boundary;
// Open /dev/mem, seek the calculated offset, and
// map it into memory so we can manipulate it
// CONFIG_STRICT_DEVMEM must be disabled for this
int mem_fd = open_memory();
seek_memory(mem_fd, offset);
printf("Buffer: %s\n", buffer);
puts("Changing buffer through /dev/mem...");
// Change the contents of the buffer by writing into /dev/mem
// Note that since the strings are the same length, there's no purpose in
// copying the NUL terminator again
if(write(mem_fd, NEW_BUFFER, strlen(NEW_BUFFER)) == -1) {
fprintf(stderr, "Write failed: %s\n", strerror(errno));
}
printf("Buffer: %s\n", buffer);
// Clean up
free(buffer);
close(mem_fd);
return 0;
}
void* create_buffer(void) {
size_t buf_size = strlen(ORIG_BUFFER) + 1;
// Allocate some memory to manipulate
void *buffer = malloc(buf_size);
if(buffer == NULL) {
fprintf(stderr, "Failed to allocate memory for buffer\n");
exit(1);
}
// Lock the page in memory
// Do this before writing data to the buffer so that any copy-on-write
// mechanisms will give us our own page locked in memory
if(mlock(buffer, buf_size) == -1) {
fprintf(stderr, "Failed to lock page in memory: %s\n", strerror(errno));
exit(1);
}
// Add some data to the memory
strncpy(buffer, ORIG_BUFFER, strlen(ORIG_BUFFER));
return buffer;
}
unsigned long get_page_frame_number_of_address(void *addr) {
// Open the pagemap file for the current process
FILE *pagemap = fopen("/proc/self/pagemap", "rb");
// Seek to the page that the buffer is on
unsigned long offset = (unsigned long)addr / getpagesize() * PAGEMAP_LENGTH;
if(fseek(pagemap, (unsigned long)offset, SEEK_SET) != 0) {
fprintf(stderr, "Failed to seek pagemap to proper location\n");
exit(1);
}
// The page frame number is in bits 0-54 so read the first 7 bytes and clear the 55th bit
unsigned long page_frame_number = 0;
fread(&page_frame_number, 1, PAGEMAP_LENGTH-1, pagemap);
page_frame_number &= 0x7FFFFFFFFFFFFF;
fclose(pagemap);
return page_frame_number;
}
int open_memory(void) {
// Open the memory (must be root for this)
int fd = open("/dev/mem", O_RDWR);
if(fd == -1) {
fprintf(stderr, "Error opening /dev/mem: %s\n", strerror(errno));
exit(1);
}
return fd;
}
void seek_memory(int fd, unsigned long offset) {
unsigned pos = lseek(fd, offset, SEEK_SET);
if(pos == -1) {
fprintf(stderr, "Failed to seek /dev/mem: %s\n", strerror(errno));
exit(1);
}
}
In function get_page_frame_number_of_address.
Please confirm open file success.
FILE *pagemap = fopen("/proc/self/pagemap", "rb");
Check the pagemap is NULL or not.
Background: I am writing MPI versions of I/O system calls, which are based on the collfs project.
The code runs without error on multiple processors on a single node.
However, running on multiple nodes causes a segmentation fault... The error message with 2 processes, 1 process per node is the following:
$ qsub test.sub
$ cat test.e291810
0: pasc_open(./libSDL.so, 0, 0)
1: pasc_open(./libSDL.so, 0, 0)
1: mptr[0]=0 mptr[len-1]=0
1: MPI_Bcast(mptr=eed11000, len=435104, MPI_BYTE, 0, MPI_COMM_WORLD)
0: mptr[0]=127 mptr[len-1]=0
0: MPI_Bcast(mptr=eeb11000, len=435104, MPI_BYTE, 0, MPI_COMM_WORLD)
_pmiu_daemon(SIGCHLD): [NID 00632] [c3-0c0s14n0] [Sun May 18 13:10:30 2014] PE RANK 0 exit signal Segmentation fault
[NID 00632] 2014-05-18 13:10:30 Apid 8283706: initiated application termination
The function where the error occurs is the following:
static int nextfd = BASE_FD;
#define next_fd() (nextfd++)
int pasc_open(const char *pathname, int flags, mode_t mode)
{
int rank;
int err;
if(!init)
return ((pasc_open_fp) def.open)(pathname, flags, mode);
if(MPI_Comm_rank(MPI_COMM_WORLD, &rank) != MPI_SUCCESS)
return -1;
dprintf("%d: %s(%s, %x, %x)\n", rank, __FUNCTION__, pathname, flags, mode);
/* Handle just read-only access for now. */
if(flags == O_RDONLY || flags == (O_RDONLY | O_CLOEXEC)) {
int fd, len, xlen, mptr_is_null;
void *mptr;
struct mpi_buf { int len, en; } buf;
struct file_entry *file;
if(rank == 0) {
len = -1;
fd = ((pasc_open_fp) def.open)(pathname, flags, mode);
/* Call stat to get file size and check for errors */
if(fd >= 0) {
struct stat st;
if(fstat(fd, &st) >= 0)
len = st.st_size;
else
((pasc_close_fp) def.close)(fd);
}
/* Record them */
buf.len = len;
buf.en = errno;
}
/* Propagate file size and errno */
if(MPI_Bcast(&buf, 2, MPI_INT, 0, MPI_COMM_WORLD) != MPI_SUCCESS)
return -1;
len = buf.len;
if(len < 0) {
dprintf("error opening file, len < 0");
return -1;
}
/* Get the page-aligned size */
xlen = page_extend(len);
/* `mmap` the file into memory */
if(rank == 0) {
mptr = ((pasc_mmap_fp) def.mmap)(0, xlen, PROT_READ, MAP_PRIVATE,
fd, 0);
} else {
fd = next_fd();
mptr = ((pasc_mmap_fp) def.mmap)(0, xlen, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
}
((pasc_lseek_fp) def.lseek)(fd, 0, SEEK_SET);
/* Ensure success on all aux. processes */
if(rank != 0)
mptr_is_null = !mptr;
MPI_Allreduce(MPI_IN_PLACE, &mptr_is_null, 1, MPI_INT, MPI_LAND,
MPI_COMM_WORLD);
if(mptr_is_null) {
if(mptr)
((pasc_munmap_fp) def.munmap)(mptr, xlen);
dprintf("%d: error: mmap/malloc error\n", rank);
return -1;
}
dprintf("%d: mptr[0]=%d mptr[len-1]=%d\n", rank, ((char*)mptr)[0], ((char*)mptr)[len-1]);
/* Propagate file contents */
dprintf("%d: MPI_Bcast(mptr=%x, len=%d, MPI_BYTE, 0, MPI_COMM_WORLD)\n",
rank, mptr, len);
if(MPI_Bcast(mptr, len, MPI_BYTE, 0, MPI_COMM_WORLD) != MPI_SUCCESS)
return -1;
if(rank != 0)
fd = next_fd();
/* Register the file in the linked list */
file = malloc(sizeof(struct file_entry));
file->fd = fd;
file->refcnt = 1;
strncpy(file->fn, pathname, PASC_FNMAX);
file->mptr = mptr;
file->len = len;
file->xlen = xlen;
file->offset = 0;
/* Reverse stack */
file->next = open_files;
open_files = file;
return fd;
}
/* Fall back to independent access */
return ((pasc_open_fp) def.open)(pathname, flags, mode);
}
The error occurs at the final MPI_Bcast call. I am at a loss as to why it is happening: the memory it copies from and to I can dereference just fine.
I am using MPICH on a custom Cray XC30 machine running SUSE Linux x86_64.
Thanks!
EDIT: I have tried replacing the MPI_Bcast call with a MPI_Send/MPI_Recv pair, and the result is the same.
The Cray MPI implementation probably does some magic for performance reasons. Without knowing the internals much of the answer is a guess.
The inter-node communication likely does not utilize the network stack, relying on some sort of shared memory communication. When you try to send mmap-ed buffer over the network stack something somewhere breaks - the DMA engine (I'm wildly guessing here) cannot handle this case.
You can try to page lock the mmaped buffer - perhaps mlock will work just fine.
If that fails, then go with copying the data into malloced buffer.