Unable to find MBR type

Unable to find MBR type - c

I have this code which is part of a project source.
This code finds the MBR type: GRUB or LILO, and accordingly sets a flag.
Surprisingly in SLES 10-SP1 (SUSE Linux Enterprise Server), it is unable to determine.
/dev/sda1 is my swap.
/dev/sda2 is where the whole / is there, including the MBR.
Same code works for SLES11 and others.
Here MBR_SIZE is #defined to 0x1be.
int lnxfsGetBootType(int pNumber)
{
int i, retval = -1, ccode;
PartInfo *p = &cpuParts[pNumber];
char buffer[SECTOR_SIZE];
var64 offset = 0;
isdLogFileOut(ZISD_LOG_DEVELOPER,"[lnxGBT]\n");
if (getenv("ZENDEVICE") || gUtilPart == 1) {
offset = p->pOffset; // look at the partition BPB
}
//Now try to find the installed boot loader...
lseek64(p->handle, (var64)offset, SEEK_SET); // either MBR or BPB
ccode = read(p->handle, buffer, SECTOR_SIZE);
for (i=0; i<MBR_SIZE-4;i++) {
if (strncmp(&buffer[i], "LILO", 4) == 0) {
if (offset == 0){
retval = FLAG_LNXFS_LILO;
isdLogFileOut(ZISD_LOG_WARNING,"\tLILO MBR found on %s\n",p->header.deviceName);
} else {
retval = FLAG_LNXFS_LILO; // 10.31.06 _BPB;
isdLogFileOut(ZISD_LOG_WARNING,"\tLILO BPB found on %s\n",p->header.deviceName);
}
}
if (strncmp(&buffer[i], "GRUB", 4) == 0) {
if (offset == 0){
retval = FLAG_LNXFS_GRUB;
isdLogFileOut(ZISD_LOG_WARNING,"\tGRUB MBR found on %s\n",p->header.deviceName);
} else {
retval = FLAG_LNXFS_GRUB; // 10.31.06 _BPB;
isdLogFileOut(ZISD_LOG_WARNING,"\tGRUB BPB found on %s\n",p->header.deviceName);
}
}
}
if (retval == -1) {
isdLogFileOut(ZISD_LOG_WARNING,"\tLILO or GRUB mbr/bpb not found on %s\n",p->header.deviceName);
}
return retval;
} // lnxfsGetBootType
Here partinfo, is a struct of partition type:
//Data structure used internally by the image engine to store information about the
//partitions. It encapsulates the PartHeader struct, whcih is used to store partition
//information in image archives
typedef struct _PartInfo
{
PartHeader header;
int handle; //file handle for reading/writing physical device
var32 flags; //Various flags as needed. Defined above.
var64 pOffset; //offset to partition from start of physical device
int deviceNumber; //index into 'devices' where this partition's
// physical device is located
int archIndex; //for restoring only. Index into imgParts of the
// archive partition this physical partition is
// mapped to
int bytesWritten; //track number of sectors written so the device-level
// cache can be flushed
void *info; //partition-type-specific info struct
/* snip */
The testing is being done with different virtual disk images under VMWare. I've confirmed the disks are formatted with MBR and not GPT.

I'm not sure what you mean when you say it doesn't work. If your point is that your code returns -1, could you show us a copy of the MBR? You can use this command to capture it:
sudo dd if=/dev/sda bs=512 count=1 | xxd
You mention that your MBR is on /dev/sda2. That is very unusual indeed. If you mean that that is where the boot code is installed, that's a totally different thing. The MBR is always held on the first sector of the disk (assuming it is a DOS-format MBR).
I suppose it's possible that the problem in some of the failure cases is a seek failure or a short read. I've made some tweaks to add error handling and simplify a bit.
#define MBR_SIZE 0x1be
int lnxfsGetBootType(int pNumber)
{
int retval = -1, ccode;
PartInfo *p = &cpuParts[pNumber];
char buffer[SECTOR_SIZE];
off64_t offset = 0;
void *plilo, *pgrub;
const char *what = "MBR";
isdLogFileOut(ZISD_LOG_DEVELOPER,"[lnxGBT]\n");
if (getenv("ZENDEVICE") || gUtilPart == 1) {
offset = p->pOffset; // look at the partition BPB
what = "BPB";
}
// Now try to find the installed boot loader...
if (lseek64(p->handle, offset, SEEK_SET) == -1) {
isdLogFileOut(ZISD_LOG_ERROR,"\tFailed to seek to %s: %s\n", what, strerror(errno));
return -1;
}
ccode = read(p->handle, buffer, SECTOR_SIZE);
if (ccode != SECTOR_SIZE) {
isdLogFileOut(ZISD_LOG_ERROR,"\tFailed to read BPB/MBR: %s\n",
strerror(errno));
return -1;
}
plilo = memmem(buffer, ccode, "LILO", 4);
pgrub = memmem(buffer, ccode, "GRUB", 4);
if (plilo) {
retval = FLAG_LNXFS_LILO;
if (pgrub && pgrub < plilo)
retval = FLAG_LNXFS_GRUB;
}
} else if (pgrub) {
retval = FLAG_LNXFS_GRUB;
}
if (-1 == retval) {
isdLogFileOut(ZISD_LOG_WARNING,"\tLILO or GRUB %s not found on %s\n", what, p->header.deviceName);
} else {
isdLogFileOut(ZISD_LOG_WARNING,"\t%s %s not found on %s\n",
(retval == FLAG_LNXFS_GRUB ? "GRUB" : "LILO"),
what, p->header.deviceName);
}
return retval;
} // lnxfsGetBootType

Related

how to delete the last created file and rename the other files in sequence in c programming

there are certain files in a directory
ex: log.0
log.1
log.2
log.3
log.4
I want to delete the last created file and rename the other files in sequence.
here log.0 is the last created file. I want to delete this and rename the other files with the sequence 0,1,2,3..
The code I used is as follows
char buffer[30];
char cmd[30];
FILE *fpipe;
char newLogFileName[30];
char oldLogFileName[30];
int fcount = 0;
int max_files = 5;
int ret;
sprintf(cmd, "ls -rt | head -n 1");
if (0 == (fpipe = (FILE*)popen(commandBuff, "r")))
{
printf("popen failed %s\n", strerror(errno) );
}
ret = fscanf(fpipe, "%s", buffer);
pclose(fpipe);
sprintf(cmd , "rm %s", buffer);
ret = system(cmd);
for ( fcount = max_files - 1; fcount >= 0; --fcount)
{
snprintf(oldLogFileName, sizeof(oldLogFileName),
"log.%d", fcount );
snprintf(newLogFileName, sizeof(newLogFileName),
"log.%d", fcount - 1);
rename(oldLogFileName, newLogFileName);
}
couldn't get the result. please suggest any changes that I have to do to solve it

I suggest one of several possible approaches:
Use scandir() (or glob()) to get the current log file names. Sort them into the proper order (either as a scandir() filter, or using qsort()). Rename each file in the array to the previous file name (no need to delete anything, because renaming over an existing file replaces the existing file atomically). Write the new file using the last file name in the array.
This has the downside that there must be at least two log files already. Also, if the user deletes say file 'log.2', then no new 'log.2' will be created, only the existing 'log.0', 'log.1', 'log.3', and 'log.4' rotated.
Use scandir() (or glob()) to get the current log file names. Sort them into the proper order (either as a scandir() filter, or using qsort()). Starting with the second log file name in the array, rename it to 'log.0', the third to 'log.1', and so on, incrementing the number by one for each log file name. This "compacts" the log file list, removing any "holes" in the numbering. The new log file will have the next incremented number.
You can also set a maximum number of log files kept, by checking if the count (of log file names in the array) has reached maximum yet. If not, rename the first log file name in the array to 'log.0', and so on, keeping all the existing log files (but removing any "holes" in the numbering). The new log file will still have the next incremented number.
Use scandir() with a filter that always returns zero (so scandir() will return no files), but only updates the minimum and maximum log file numbers (as global variables). This gives you the log file number range, so you can use a simple loop to rename each log file (that exists) to the previous number.
If the user deletes one or more of the log files, those rename() operations will fail with errno == ENOENT; this is simply ignored. This way, those "holes" will percolate through the log file list as usual, and no compaction will occur.
Use a loop over stat() to find the initial consecutive range of log files. If 'log.0' does not exist, you save the new log file to 'log.0'. If you wish to keep up to fifteen log files, if any of 'log.1' to 'log.13' does not exist, you save the new log file there. If all fifteen log files, 'log.0' to 'log.14', exist, you rename 'log.1' to 'log.0', 'log.2' to 'log.1', and so on through to 'log.13' to 'log.12', and 'log.14' to 'log.13', and save the new log file to 'log.14'.
When the number of log files to be kept is reasonably small (say, less than a hundred), this is very efficient. The only downside is that if the user deletes a log file by hand, the next new log file will be saved in that "hole", thus possibly mangling the order.
In all cases I recommend using a simple dynamically allocated memory pattern to construct the file names using snprintf() safely. You start by allocating a small initial buffer, say 128 bytes:
char *name;
size_t size = 128;
name = malloc(size);
if (!name) {
fprintf(stderr, "Not enough memory!\n");
exit(EXIT_FAILURE);
}
Then, when you wish to generate the path based on LOG_NAME_PATTERN (say, "log.%d", or "/var/log/myapp/log.%d") and int count, you do
while (1) {
int len = snprintf(name, size, LOG_NAME_PATTERN, count);
if (len < 0) {
fprintf(stderr, "Invalid LOG_NAME_PATTERN (%s).\n", LOG_NAME_PATTERN);
return EXIT_FAILURE;
} else
if ((size_t)len >= size) {
/* Resize buffer to anything larger than len */
const size_t new_size = ((size_t)len | 15) + 17;
char *new_name;
if (new_size <= (size_t)len) {
fprintf(stderr, "Path pattern is too long.\n");
exit(EXIT_FAILURE);
}
new_name = realloc(name, new_size);
if (!new_name) {
/* Note: name is still valid! */
fprintf(stderr, "Not enough memory.\n");
exit(EXIT_FAILURE);
}
name = new_name;
size = new_size;
continue;
}
/* Have correct path in name. */
break;
}
The above loop resizes (reallocates) the buffer when necessary.
Also note that if the above has succeeded for some nonnegative count, then you can safely allocate another buffer using
char *newname = malloc(size);
if (!newname) {
fprintf(stderr, "Not enough memory.\n");
exit(EXIT_FAILURE);
}
and then you can safely use (void)snprintf(name, size, LOG_NAME_PATTERN, count) or (void)snprintf(name, size, LOG_NAME_PATTERN, i) for any int i where i is nonnegative and not greater than count –– as you need for the rename() operations.
I prefer to use a preprocessor macro for the log file name pattern. For example,
#ifndef LOG_NAME_PATTERN
#define LOG_NAME_PATTERN "log.%d"
#endif
Then, at compile time, one can use -DLOG_NAME_PATTERN="/var/log/myapp.%d" (in say CFLAGS in the Makefile, or directly as a parameter to GCC or Clang or whatever C compiler you use) to override the above default. (The quotes are necessary; but they can also be added using a couple of helper preprocessor macros.)
In general, applications should probably use the system logging facilities: openlog("myapplication", LOG_NDELAY | LOG_PID, LOG_USER) at the beginning of the process, then syslog(LOG_ERR, "format", ...) to log the errors.
Let's assume there is some reason why that is not a reasonable approach, and explore the other options.
Typically, application log file is /var/log/application.log or /var/log/application/name.log (with different name parts). The application always uses append mode (via O_APPEND or "a").
Older log files are in /var/log/application.log.# or /var/log/application/name.log.# with # being a decimal number starting with 1; or in compressed when rotated, in /var/log/application.log.#.EXT or /var/log/application.log.#.EXT where EXT is gz, bz2, or xz. These older log files are considered archived: read-only, not to be modified/appended to.
In these cases, only the current log file (/var/log/application.log or /var/log/application/name.log) is being appended to. The log file rotator renames this log file temporarily, then usually sends a signal (typically HUP) to the service process so it knows to re-open the log file. It then waits until the service process no longer has the temporarily renamed log file open, before compressing the log file, and then does the log rotation. (Privileged processes and processes running as the same user as the target file can take an exclusive file lease to verify no other process has the file open. If the first archived log file (.log.1) is not compressed, the log rotator does not need to check if it is still open.)
Typically, a sane service or application uses logrotate to rotate its logs, by dropping a logrotate configuration snippet as /etc/logrotate.d/application. The postrotate .. endscript part is a command that tells the application, if it is currently running, to close its current log file, and reopen it. It is usually a command that connects to the application; sending SIGHUP for this is particularly common.
Here is a trivial example program you can treat as such a service:
#define _POSIX_C_SOURCE 200809L
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdarg.h>
#include <signal.h>
// MT: #include <pthread.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
#include <errno.h>
#ifndef LOG_TIME
#define LOG_TIME "%Y-%m-%d %H:%M:%S.^^^ %Z: "
#endif
#ifndef LOG_FILE
#define LOG_FILE "myapp.log"
#endif
#ifndef PID_FILE
#define PID_FILE "myapp.pid"
#endif
// MT: static pthread_mutex_t log_lock = PTHREAD_MUTEX_INITIALIZER;
static volatile sig_atomic_t log_reopen = 0;
static int log_fd = -1;
int log_print(const char *format, ...)
{
va_list args;
int fd, len, tlen;
/* Don't bother if no format. */
if (!format || !*format)
return 0;
// MT: pthread_mutex_lock(&log_lock);
fd = log_fd;
if (log_reopen) {
log_reopen = 0;
if (fd != -1) {
close(fd);
fd = -1;
}
}
if (fd == -1) {
do {
fd = open(LOG_FILE, O_WRONLY | O_APPEND | O_CREAT, 0666);
} while (fd == -1 && errno == EINTR);
if (fd == -1) {
// MT: pthread_mutex_unlock(&log_lock);
return -1;
}
log_fd = fd;
}
tlen = 0;
#ifdef LOG_TIME
const char *timefmt = LOG_TIME;
do {
if (!timefmt || !*timefmt)
break;
struct timespec now_ts;
struct tm now;
if (clock_gettime(CLOCK_REALTIME, &now_ts) == -1) {
now_ts.tv_sec = time(NULL);
now_ts.tv_nsec = 0L;
}
if (localtime_r(&now_ts.tv_sec, &now) != &now)
break;
char stamp[128];
size_t stamplen = strftime(stamp, sizeof stamp, LOG_TIME, &now);
if (!stamplen)
break;
char *dst = strchr(stamp, '^');
if (dst) {
unsigned long value = now_ts.tv_nsec;
int digits = 1;
while (dst[digits] == '^')
digits++;
while (digits > 9)
dst[--digits] = '0';
switch (digits) {
case 9: dst[8] = '0' + ( value % 10uL);
case 8: dst[7] = '0' + ((value / 10uL) % 10uL);
case 7: dst[6] = '0' + ((value / 100uL) % 10uL);
case 6: dst[5] = '0' + ((value / 1000uL) % 10uL);
case 5: dst[4] = '0' + ((value / 10000uL) % 10uL);
case 4: dst[3] = '0' + ((value / 100000uL) % 10uL);
case 3: dst[2] = '0' + ((value / 1000000uL) % 10uL);
case 2: dst[1] = '0' + ((value / 10000000uL) % 10uL);
case 1: dst[0] = '0' + ((value / 100000000uL) % 10uL);
}
}
const char *const end = stamp + stamplen;
const char *src = stamp;
while (src < end) {
ssize_t n = write(fd, src, (size_t)(end - src));
if (n > 0) {
src += n;
} else
if (n != -1 || errno != EINTR) {
close(fd);
log_fd = -1;
// MT: pthread_mutex_unlock(&log_lock);
return -1;
}
}
tlen = (int)stamplen;
} while (0);
#endif
va_start(args, format);
len = vdprintf(fd, format, args);
va_end(args);
// MT: pthread_mutex_unlock(&log_lock);
if (len < 0) {
return len;
} else {
return len + tlen;
}
}
static void log_rotate(int signum)
{
log_reopen = 1;
/* Silence warning about unused parameter; generates no code. */
(void)signum;
}
int install_log_rotate_signal(int signum)
{
struct sigaction act;
memset(&act, 0, sizeof act);
sigemptyset(&act.sa_mask);
act.sa_handler = log_rotate;
act.sa_flags = SA_RESTART;
return sigaction(signum, &act, NULL);
}
/*
* The following is only for testing this as a program;
* CTRL-C, SIGINT, SIGQUIT, SIGTERM will terminate it cleanly.
*/
static volatile sig_atomic_t done = 0;
static void handle_done(int signum)
{
done = 1;
(void)signum;
}
static int install_done(int signum)
{
struct sigaction act;
memset(&act, 0, sizeof act);
sigemptyset(&act.sa_mask);
act.sa_handler = handle_done;
act.sa_flags = 0;
return sigaction(signum, &act, NULL);
}
static int create_pidfile(void)
{
const char *path = PID_FILE;
int fd, len;
do {
fd = open(path, O_WRONLY | O_CREAT, 0666);
} while (fd == -1 && errno == EINTR);
if (fd == -1)
return errno;
len = dprintf(fd, "%ld\n", (long)getpid());
if (len < 1) {
unlink(path);
close(fd);
return errno = EIO;
}
if (close(fd) == -1) {
unlink(path);
return errno = EIO;
}
return 0;
}
static void remove_pidfile(void)
{
unlink(PID_FILE);
}
static double dsleep(const double seconds)
{
struct timespec req, rem;
if (seconds > 0.0) {
req.tv_sec = (long)seconds;
req.tv_nsec = (seconds - (double)req.tv_sec) * 1000000000L;
/* Check for rounding error */
if (req.tv_nsec < 0L)
req.tv_nsec = 0L;
else
if (req.tv_nsec > 999999999L)
req.tv_nsec = 999999999L;
} else {
req.tv_sec = 0;
req.tv_nsec = 0L;
}
if (nanosleep(&req, &rem) == -1) {
if (errno == EINTR)
return seconds - (double)rem.tv_sec - (double)rem.tv_nsec / 1000000000.0;
else
return -1.0;
}
return seconds;
}
int main(void)
{
if (install_log_rotate_signal(SIGHUP) == -1 ||
install_done(SIGINT) == -1 ||
install_done(SIGQUIT) == -1 ||
install_done(SIGTERM) == -1) {
fprintf(stderr, "Cannot install signal handlers: %s.\n", strerror(errno));
return EXIT_FAILURE;
}
if (create_pidfile()) {
fprintf(stderr, "Cannot create PID file '%s': %s.\n", PID_FILE, strerror(errno));
return EXIT_FAILURE;
}
while (!done) {
double s = dsleep(5.0);
log_print("Slept %.3f seconds.\n", s);
}
remove_pidfile();
return EXIT_SUCCESS;
}
Above, LOG_FILE is the path to the log file (usually /var/log/myapplication.log or /var/log/myapplication/name.log), PID_FILE is the path to the PID file (usually /var/run/myapplication.pid), and LOG_TIME is the timestamp prepended to each log event in strftime() format in local time, except with consecutive ^ characters are replaced with the fractional seconds. (So, "%S.^^^" yields seconds using three decimals, and `"%s.^^^^^^^^^" yields the Unix epoch time at nanosecond precision.)
The // MT: lines are needed if the application is multithreaded.
If you change PID_FILE to /var/run/myapp.pid and LOG_FILE to /var/log/myapp/myapp.log, and make /var/log/myapp/ writable to the user running the above program, you can use logrotate with the following snippet (/etc/logrotate.d/myapp) to rotate its log files daily, keeping up to 15 files, compressing archived log files:
/var/log/myapp/myapp.log {
rotate 15
daily
missingok
notifempty
compress
delaycompress
postrotate
kill -HUP `cat /var/run/myapp.pid` >/dev/null 2>&1 || true
endscript
}

Why can I not mmap /proc/self/maps?

To be specific: why can I do this:
FILE *fp = fopen("/proc/self/maps", "r");
char buf[513]; buf[512] = NULL;
while(fgets(buf, 512, fp) > NULL) printf("%s", buf);
but not this:
int fd = open("/proc/self/maps", O_RDONLY);
struct stat s;
fstat(fd, &s); // st_size = 0 -> why?
char *file = mmap(0, s.st_size /*or any fixed size*/, PROT_READ, MAP_PRIVATE, fd, 0); // gives EINVAL for st_size (because 0) and ENODEV for any fixed block
write(1, file, st_size);
I know that /proc files are not really files, but it seems to have some defined size and content for the FILE* version. Is it secretly generating it on-the-fly for read or something? What am I missing here?
EDIT:
as I can clearly read() from them, is there any way to get the possible available bytes? or am I stuck to read until EOF?

They are created on the fly as you read them. Maybe this would help, it is a tutorial showing how a proc file can be implemented:
https://devarea.com/linux-kernel-development-creating-a-proc-file-and-interfacing-with-user-space/
tl;dr: you give it a name and read and write handlers, that's it. Proc files are meant to be very simple to implement from the kernel dev's point of view. They do not behave like full-featured files though.
As for the bonus question, there doesn't seem to be a way to indicate the size of the file, only EOF on reading.

proc "files" are not really files, they are just streams that can be read/written from, but they contain no pyhsical data in memory you can map to.
https://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html

As already explained by others, /proc and /sys are pseudo-filesystems, consisting of data provided by the kernel, that does not really exist until it is read – the kernel generates the data then and there. Since the size varies, and really is unknown until the file is opened for reading, it is not provided to userspace at all.
It is not "unfortunate", however. The same situation occurs very often, for example with character devices (under /dev), pipes, FIFOs (named pipes), and sockets.
We can trivially write a helper function to read pseudofiles completely, using dynamic memory management. For example:
// SPDX-License-Identifier: CC0-1.0
//
#define _POSIX_C_SOURCE 200809L
#define _ATFILE_SOURCE
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
/* For example main() */
#include <stdio.h>
/* Return a directory handle for a specific relative directory.
For absolute paths and paths relative to current directory, use dirfd==AT_FDCWD.
*/
int at_dir(const int dirfd, const char *dirpath)
{
if (dirfd == -1 || !dirpath || !*dirpath) {
errno = EINVAL;
return -1;
}
return openat(dirfd, dirpath, O_DIRECTORY | O_PATH | O_CLOEXEC);
}
/* Read the (pseudofile) contents to a dynamically allocated buffer.
For absolute paths and paths relative to current durectory, use dirfd==AT_FDCWD.
You can safely initialize *dataptr=NULL,*sizeptr=0 for dynamic allocation,
or reuse the buffer from a previous call or e.g. getline().
Returns 0 with errno set if an error occurs. If the file is empty, errno==0.
In all cases, remember to free (*dataptr) after it is no longer needed.
*/
size_t read_pseudofile_at(const int dirfd, const char *path, char **dataptr, size_t *sizeptr)
{
char *data;
size_t size, have = 0;
ssize_t n;
int desc;
if (!path || !*path || !dataptr || !sizeptr) {
errno = EINVAL;
return 0;
}
/* Existing dynamic buffer, or a new buffer? */
size = *sizeptr;
if (!size)
*dataptr = NULL;
data = *dataptr;
/* Open pseudofile. */
desc = openat(dirfd, path, O_RDONLY | O_CLOEXEC | O_NOCTTY);
if (desc == -1) {
/* errno set by openat(). */
return 0;
}
while (1) {
/* Need to resize buffer? */
if (have >= size) {
/* For pseudofiles, linear size growth makes most sense. */
size = (have | 4095) + 4097 - 32;
data = realloc(data, size);
if (!data) {
close(desc);
errno = ENOMEM;
return 0;
}
*dataptr = data;
*sizeptr = size;
}
n = read(desc, data + have, size - have);
if (n > 0) {
have += n;
} else
if (n == 0) {
break;
} else
if (n == -1) {
const int saved_errno = errno;
close(desc);
errno = saved_errno;
return 0;
} else {
close(desc);
errno = EIO;
return 0;
}
}
if (close(desc) == -1) {
/* errno set by close(). */
return 0;
}
/* Append zeroes - we know size > have at this point. */
if (have + 32 > size)
memset(data + have, 0, 32);
else
memset(data + have, 0, size - have);
errno = 0;
return have;
}
int main(void)
{
char *data = NULL;
size_t size = 0;
size_t len;
int selfdir;
selfdir = at_dir(AT_FDCWD, "/proc/self/");
if (selfdir == -1) {
fprintf(stderr, "/proc/self/ is not available: %s.\n", strerror(errno));
exit(EXIT_FAILURE);
}
len = read_pseudofile_at(selfdir, "status", &data, &size);
if (errno) {
fprintf(stderr, "/proc/self/status: %s.\n", strerror(errno));
exit(EXIT_FAILURE);
}
printf("/proc/self/status: %zu bytes\n%s\n", len, data);
len = read_pseudofile_at(selfdir, "maps", &data, &size);
if (errno) {
fprintf(stderr, "/proc/self/maps: %s.\n", strerror(errno));
exit(EXIT_FAILURE);
}
printf("/proc/self/maps: %zu bytes\n%s\n", len, data);
close(selfdir);
free(data); data = NULL; size = 0;
return EXIT_SUCCESS;
}
The above example program opens a directory descriptor ("atfile handle") to /proc/self. (This way you do not need to concatenate strings to construct paths.)
It then reads the contents of /proc/self/status. If successful, it displays its size (in bytes) and its contents.
Next, it reads the contents of /proc/self/maps, reusing the previous buffer. If successful, it displays its size and contents as well.
Finally, the directory descriptor is closed as it is no longer needed, and the dynamically allocated buffer released.
Note that it is perfectly safe to do free(NULL), and also to discard the dynamic buffer (free(data); data=NULL; size=0;) between the read_pseudofile_at() calls.
Because pseudofiles are typically small, the read_pseudofile_at() uses a linear dynamic buffer growth policy. If there is no previous buffer, it starts with 8160 bytes, and grows it by 4096 bytes afterwards until sufficiently large. Feel free to replace it with whatever growth policy you prefer, this one is just an example, but works quite well in practice without wasting much memory.

Pinning user space buffer for DMA from Linux kernel

I'm writing driver for devices that produce around 1GB of data per second. Because of that I decided to map user buffer allocated by application directly for DMA instead of copying through intermediate kernel buffer.
The code works, more or less. But during long-run stress testing I see kernel oops with "bad page state" initiated by unrelated applications (for instance updatedb), probably when kernel wants to swap some pages:
[21743.515404] BUG: Bad page state in process PmStabilityTest pfn:357518
[21743.521992] page:ffffdf844d5d4600 count:19792158 mapcount:0 mapping: (null) index:0x12b011e012d0132
[21743.531829] flags: 0x119012c01220124(referenced|lru|slab|reclaim|uncached|idle)
[21743.539138] raw: 0119012c01220124 0000000000000000 012b011e012d0132 012e011e011e0111
[21743.546899] raw: 0000000000000000 012101300131011c 0000000000000000 012101240123012b
[21743.554638] page dumped because: page still charged to cgroup
[21743.560383] page->mem_cgroup:012101240123012b
[21743.564745] bad because of flags: 0x120(lru|slab)
[21743.569555] BUG: Bad page state in process PmStabilityTest pfn:357519
[21743.576098] page:ffffdf844d5d4640 count:18219302 mapcount:18940179 mapping: (null) index:0x0
[21743.585318] flags: 0x0()
[21743.587859] raw: 0000000000000000 0000000000000000 0000000000000000 0116012601210112
[21743.595599] raw: 0000000000000000 011301310127012f 0000000000000000 012f011d010d011a
[21743.603336] page dumped because: page still charged to cgroup
[21743.609108] page->mem_cgroup:012f011d010d011a
...
Entering kdb (current=0xffff8948189b2d00, pid 6387) on processor 6 Oops: (null)
due to oops # 0xffffffff9c87f469
CPU: 6 PID: 6387 Comm: updatedb.mlocat Tainted: G B OE 4.10.0-42-generic #46~16.04.1-Ubuntu
...
Details:
The user buffer consists of frames and neither the buffer not the frames are page-aligned. The frames in buffer are used in circular manner for "infinite" live data transfers. For each frame I get memory pages via get_user_pages_fast, then convert it to scatter-gatter table with sg_alloc_table_from_pages and finally map for DMA using dma_map_sg.
I rely on sg_alloc_table_from_pages to bind consecutive pages into one DMA descriptor to reduce size of S/G table sent to device. Devices are custom built and utilize FPGA. I took inspiration from many drivers doing similar mapping, especially video drivers i915 and radeon, but no one has all the stuff on one place so I might overlook something.
Related functions (pin_user_buffer and unpin_user_buffer are called upon separate IOCTLs):
static int pin_user_frame(struct my_dev *cam, struct udma_frame *frame)
{
const unsigned long bytes = cam->acq_frame_bytes;
const unsigned long first =
( frame->uaddr & PAGE_MASK) >> PAGE_SHIFT;
const unsigned long last =
((frame->uaddr + bytes - 1) & PAGE_MASK) >> PAGE_SHIFT;
const unsigned long offset =
frame->uaddr & ~PAGE_MASK;
int nr_pages = last - first + 1;
int err;
int n;
struct page **pages;
struct sg_table *sgt;
if (frame->uaddr + bytes < frame->uaddr) {
pr_err("%s: attempted user buffer overflow!\n", __func__);
return -EINVAL;
}
if (bytes == 0) {
pr_err("%s: user buffer has zero bytes\n", __func__);
return -EINVAL;
}
pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL | __GFP_ZERO);
if (!pages) {
pr_err("%s: can't allocate udma_frame.pages\n", __func__);
return -ENOMEM;
}
sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
if (!sgt) {
pr_err("%s: can't allocate udma_frame.sgt\n", __func__);
err = -ENOMEM;
goto err_alloc_sgt;
}
/* (rw == READ) means read from device, write into memory area */
err = get_user_pages_fast(frame->uaddr, nr_pages, READ == READ, pages);
if (err < nr_pages) {
nr_pages = err;
if (err > 0) {
pr_err("%s: can't pin all %d user pages, got %d\n",
__func__, nr_pages, err);
err = -EFAULT;
} else {
pr_err("%s: can't pin user pages\n", __func__);
}
goto err_get_pages;
}
for (n = 0; n < nr_pages; ++n)
flush_dcache_page(pages[n]); //<--- Is this needed?
err = sg_alloc_table_from_pages(sgt, pages, nr_pages, offset, bytes,
GFP_KERNEL);
if (err) {
pr_err("%s: can't build sg_table for %d pages\n",
__func__, nr_pages);
goto err_alloc_sgt2;
}
if (!dma_map_sg(&cam->pci_dev->dev, sgt->sgl, sgt->nents, DMA_FROM_DEVICE)) {
pr_err("%s: can't map %u sg_table entries for DMA\n",
__func__, sgt->nents);
err = -ENOMEM;
goto err_dma_map;
}
frame->pages = pages;
frame->nr_pages = nr_pages;
frame->sgt = sgt;
return 0;
err_dma_map:
sg_free_table(sgt);
err_alloc_sgt2:
err_get_pages:
for (n = 0; n < nr_pages; ++n)
put_page(pages[n]);
kfree(sgt);
err_alloc_sgt:
kfree(pages);
return err;
}
static void unpin_user_frame(struct my_dev *cam, struct udma_frame *frame)
{
int n;
dma_unmap_sg(&cam->pci_dev->dev, frame->sgt->sgl, frame->sgt->nents,
DMA_FROM_DEVICE);
sg_free_table(frame->sgt);
kfree(frame->sgt);
frame->sgt = NULL;
for (n = 0; n < frame->nr_pages; ++n) {
struct page *page = frame->pages[n];
set_page_dirty_lock(page);
mark_page_accessed(page); //<--- Without this the Oops are more frequent
put_page(page);
}
kfree(frame->pages);
frame->pages = NULL;
frame->nr_pages = 0;
}
static void unpin_user_buffer(struct my_dev *cam)
{
if (cam->udma_frames) {
int n;
for (n = 0; n < cam->udma_frame_count; ++n)
unpin_user_frame(cam, &cam->udma_frames[n]);
kfree(cam->udma_frames);
cam->udma_frames = NULL;
}
cam->udma_frame_count = 0;
cam->udma_buffer_bytes = 0;
cam->udma_buffer = NULL;
cam->udma_desc_count = 0;
}
static int pin_user_buffer(struct my_dev *cam)
{
int err;
int n;
const u32 acq_frame_count = cam->acq_buffer_bytes / cam->acq_frame_bytes;
struct udma_frame *udma_frames;
u32 udma_desc_count = 0;
if (!cam->acq_buffer) {
pr_err("%s: user buffer is NULL!\n", __func__);
return -EFAULT;
}
if (cam->udma_buffer == cam->acq_buffer
&& cam->udma_buffer_bytes == cam->acq_buffer_bytes
&& cam->udma_frame_count == acq_frame_count)
return 0;
if (cam->udma_buffer)
unpin_user_buffer(cam);
udma_frames = kcalloc(acq_frame_count, sizeof(*udma_frames),
GFP_KERNEL | __GFP_ZERO);
if (!udma_frames) {
pr_err("%s: can't allocate udma_frame array for %u frames\n",
__func__, acq_frame_count);
return -ENOMEM;
}
for (n = 0; n < acq_frame_count; ++n) {
struct udma_frame *frame = &udma_frames[n];
frame->uaddr =
(unsigned long)(cam->acq_buffer + n * cam->acq_frame_bytes);
err = pin_user_frame(cam, frame);
if (err) {
pr_err("%s: can't pin frame %d (out of %u)\n",
__func__, n + 1, acq_frame_count);
for (--n; n >= 0; --n)
unpin_user_frame(cam, frame);
kfree(udma_frames);
return err;
}
udma_desc_count += frame->sgt->nents; /* Cannot overflow */
}
pr_debug("%s: total udma_desc_count=%u\n", __func__, udma_desc_count);
cam->udma_buffer = cam->acq_buffer;
cam->udma_buffer_bytes = cam->acq_buffer_bytes;
cam->udma_frame_count = acq_frame_count;
cam->udma_frames = udma_frames;
cam->udma_desc_count = udma_desc_count;
return 0;
}
Related structures:
struct udma_frame {
unsigned long uaddr; /* User address of the frame */
int nr_pages; /* Nr. of pages covering the frame */
struct page **pages; /* Actual pages covering the frame */
struct sg_table *sgt; /* S/G table describing the frame */
};
struct my_dev {
...
u8 __user *acq_buffer; /* User-space buffer received via IOCTL */
...
u8 __user *udma_buffer; /* User-space buffer for image */
u32 udma_buffer_bytes; /* Total image size in bytes */
u32 udma_frame_count; /* Nr. of items in udma_frames */
struct udma_frame
*udma_frames; /* DMA descriptors per frame */
u32 udma_desc_count; /* Total nr. of DMA descriptors */
...
};
Questions:
How to properly pin user buffer pages and mark them as not movable?
If one frame ends and next frame starts in the same page, is it correct to handle it as two independent pages, i.e. pin the page twice?
The data comes from device to user buffer and app is supposed to not write to its buffer, but I have no control over it. Can I use DMA_FROM_DEVICE or rather
use DMA_BIDIRECTIONAL just in case?
Do I need to use something like SetPageReserved/ClearPageReserved or mark_page_reserved/free_reserved_page?
Is IOMMU/swiotlb somehow involved? E.g. i915 driver doesn't use sg_alloc_table_from_pages if swiotlb is active?
What the difference between set_page_dirty, set_page_dirty_lock and SetPageDirty functions?
Thanks for any hint.
PS: I cannot change the way the application gets the data without breaking our library API maintained for many years. So please do not advise e.g. to mmap kernel buffer...

Why do you put "READ == READ" as the third paramter? You need put flag there.
err = get_user_pages_fast(frame->uaddr, nr_pages, READ == READ, pages);
You need put "FOLL_LONGTERM" here, and FOLL_PIN is set by get_user_pages_fast internally. See https://www.kernel.org/doc/html/latest/core-api/pin_user_pages.html#case-2-rdma
In addition, you need take care of cpu and device memory coherence. Just call "dma_sync_sg_for_device(...)" before dma transfer, and "dma_sync_sg_for_cpu(...)" after dma transfer.

FFmpeg: unspecified pixel format when opening video with custom context

I am trying to decode a video with a custom context. The purpose is that I want to decode the video directly from memory. In the following code, I am reading from file in the read function passed to avio_alloc_context - but this is just for testing purposes.
I think I've read any post there is on Stackoverflow or on any other website related to this topic. At least I definitely tried my best to do so. While there is much in common, the details differ: people set different flags, some say av_probe_input_format is required, some say it isn't, etc. And for some reason nothing works for me.
My problem is that the pixel format is unspecified (see output below), which is why I run into problems later when calling sws_getContext. I checked pFormatContext->streams[videoStreamIndex]->codec->pix_fmt, and it is -1.
Please note my comments // things I tried and // seems not to help in the code. I think, the answer might be hidden somehwere there. I tried many combinations of hints that I've read so far, but I am missing a detail I guess.
The problem is not the video file, because when I go the standard way and just call avformat_open_input(&pFormatContext, pFilePath, NULL, NULL) without a custom context, everything runs fine.
The code compiles and runs as is.
#include <libavformat/avformat.h>
#include <string.h>
#include <stdio.h>
FILE *f;
static int read(void *opaque, uint8_t *buf, int buf_size) {
if (feof(f)) return -1;
return fread(buf, 1, buf_size, f);
}
int openVideo(const char *pFilePath) {
const int bufferSize = 32768;
int ret;
av_register_all();
f = fopen(pFilePath, "rb");
uint8_t *pBuffer = (uint8_t *) av_malloc(bufferSize + AVPROBE_PADDING_SIZE);
AVIOContext *pAVIOContext = avio_alloc_context(pBuffer, bufferSize, 0, NULL,
&read, NULL, NULL);
if (!f || !pBuffer || !pAVIOContext) {
printf("error: open / alloc failed\n");
// cleanup...
return 1;
}
AVFormatContext *pFormatContext = avformat_alloc_context();
pFormatContext->pb = pAVIOContext;
const int readBytes = read(NULL, pBuffer, bufferSize);
printf("readBytes = %i\n", readBytes);
if (readBytes <= 0) {
printf("error: read failed\n");
// cleanup...
return 2;
}
if (fseek(f, 0, SEEK_SET) != 0) {
printf("error: fseek failed\n");
// cleanup...
return 3;
}
// required for av_probe_input_format
memset(pBuffer + readBytes, 0, AVPROBE_PADDING_SIZE);
AVProbeData probeData;
probeData.buf = pBuffer;
probeData.buf_size = readBytes;
probeData.filename = "";
probeData.mime_type = NULL;
pFormatContext->iformat = av_probe_input_format(&probeData, 1);
// things I tried:
//pFormatContext->flags = AVFMT_FLAG_CUSTOM_IO;
//pFormatContext->iformat->flags |= AVFMT_NOFILE;
//pFormatContext->iformat->read_header = NULL;
// seems not to help (therefore commented out here):
AVDictionary *pDictionary = NULL;
//av_dict_set(&pDictionary, "analyzeduration", "8000000", 0);
//av_dict_set(&pDictionary, "probesize", "8000000", 0);
if ((ret = avformat_open_input(&pFormatContext, "", NULL, &pDictionary)) < 0) {
char buffer[4096];
av_strerror(ret, buffer, sizeof(buffer));
printf("error: avformat_open_input failed: %s\n", buffer);
// cleanup...
return 4;
}
printf("retrieving stream information...\n");
if ((ret = avformat_find_stream_info(pFormatContext, NULL)) < 0) {
char buffer[4096];
av_strerror(ret, buffer, sizeof(buffer));
printf("error: avformat_find_stream_info failed: %s\n", buffer);
// cleanup...
return 5;
}
printf("nb_streams = %i\n", pFormatContext->nb_streams);
// further code...
// cleanup...
return 0;
}
int main() {
openVideo("video.mp4");
return 0;
}
This is the output that I get:
readBytes = 32768
retrieving stream information...
[mov,mp4,m4a,3gp,3g2,mj2 # 0xdf8d20] stream 0, offset 0x30: partial file
[mov,mp4,m4a,3gp,3g2,mj2 # 0xdf8d20] Could not find codec parameters for stream 0 (Video: h264 (avc1 / 0x31637661), none, 640x360, 351 kb/s): unspecified pixel format
Consider increasing the value for the 'analyzeduration' and 'probesize' options
nb_streams = 2
UPDATE:
Thanks to WLGfx, here is the solution: The only thing that was missing was the seek function. Apparently, implementing it is mandatory for decoding. It is important to return the new offset - and not 0 in case of success (some solutions found in the web just return the return value of fseek, and that is wrong). Here is the minimal solution that made it work:
static int64_t seek(void *opaque, int64_t offset, int whence) {
if (whence == SEEK_SET && fseek(f, offset, SEEK_SET) == 0) {
return offset;
}
// handling AVSEEK_SIZE doesn't seem mandatory
return -1;
}
Of course, the call to avio_alloc_context needs to be adapted accordingly:
AVIOContext *pAVIOContext = avio_alloc_context(pBuffer, bufferSize, 0, NULL,
&read, NULL, &seek);

Seeing as yours is a file based stream then it is seekable so you can provide the AVIO seek when creating the AVIOContext:
avioContext = avio_alloc_context((uint8_t *)avio_buffer, AVIO_QUEUE_SIZE * PKT_SIZE7,
0,
this, // *** This is your data pointer to a class or other data passed to the callbacks
avio_ReadFunc,
NULL,
avio_SeekFunc);
Handle the seeking with this callback: (You can cast ptr to your class or other data structure)
int64_t FFIOBufferManager::avio_SeekFunc(void *ptr, int64_t pos64, int whence) {
// SEEK_SET(0), SEEK_CUR(1), SEEK_END(2), AVSEEK_SIZE
// ptr is cast to your data or class
switch (whence) {
case 0 : // SEEK_SET
... etc
case (AVSEEK_SIZE) : // get size
return -1; // if you're unable to get the size
break;
}
// set new position in the file
return (int64_t)new_pos; // new position
}
You can also define the codec and the probesize when attaching the AVIOContext to the AVFormatContext. This allows ffmpeg to seek in the stream to better determine the format.
context->pb = ffio->avioContext;
context->flags = AVFMT_FLAG_CUSTOM_IO;
context->iformat = av_find_input_format("mpegts"); // not necessary
context->probesize = 1200000;
So far I haven't had the need for av_probe_input_format, but then again my streams are mpegts.
Hope this helps.
EDIT: Added a comment to the avio_alloc_context function to mention how the ptr is used in the callbacks.

Although the seek was the right answer in your situation, the fact is in my case it's not possible because I have to stream the data and in that situation a seek is just not possible.
So I had to look into: why is a seek required?
From what the ffmpeg docs say, they will cache some data so that way they can seek back if required by the current encoder/decoder. But that buffer is relatively small (you probably don't want to cache 100's of Mb of data).
The fact is that MP4 saves some metadata at the end of the file (once it's known). When reading that format, the decoder wants to seek to a position really very far in the file (near the end) and read what is called the moov atom. Without that info, the system doesn't want to decompress your data.
What I had to do to fix this issue is move that moov atom with the following command:
ffmpeg -i far.mp4 -c copy -map 0 -movflags +faststart close.mp4
faststart means you do not have to stream the entire file to start playing (decoding) the file.

Is it possible to map an existing buffer to a new file?

The idea is relatively simple, but I see some complications for implementations, so I'm wondering if it's even possible right now.
An example of what I'd like to do is to generate some data in a
buffer, then map the contents of this buffer to a file. Instead of
having the memory space virtually populated with the contents of the
file, I'd like the contents of the original buffer to be transferred
to the system cache (which should be a zero-copy operation) and
dirtied immediately (which would flush the data out to disk
eventually).
Of course the complication I mentioned is that the buffer should be deallocated and unmapped (since the data is now under the responsibility of the system cache), and I don't know how to do that either.
The important aspects are that:
The program can control when the file is created linked.
The program isn't required to anticipate the size of the file nor does it have to remap it as the dataset grows. Instead it can realloc the initial buffer (using an efficient memory allocator for this) until it is satisfied (it knows for sure that the dataset won't grow anymore) before finally mapping it to the file.
The data remains accessible through the same virtual memory address even after being mapped to the file, still without a single intra-memory copy.
One assumption is that:
We can use an arbitrary memory allocator (or memory management scheme in general) that can manage dynamic buffers more efficiently than mmap/mremap can for the memory space it manages, because the latter must deal with the filesystem to grow/shrink the file, which would always be slower.
So, (1) are these requirements too constrained? (2) Is this assumption correct?
PS: I had to arbitrarily pick the tags for this question, but I'm also interested in hearing how BSDs and Windows would do this. Of course if the POSIX API allows to do this already, that would be great.
Update: I call a buffer a space of private memory (private to the process/task in any OS with normal VMM) allocated in primary memory. The high-level goal involves generating a dataset of an arbitrary size using another input (in my case the network), then once it's generated, make it accessible for long periods of time (to the network and to the process itself), saving it to disk in the process.
If I keep the datasets in private memory and write them out normally, they'll just be swapped when the OS needs the space, which is a bit stupid since they're already on disk.
If I map another region then I have to copy the contents of the buffer to that region (which resides in the system cache), which, again, is a tad stupid since I won't use that buffer after that.
The alternative that I see is to write or use a full-blown userland cache reading and writing to the disk itself to ensure that (a) pages don't get uselessly swapped out and (b) the process doesn't hold too much memory for itself, which is never possible to do optimally anyway (better let the kernel do its job), and which is simply not a road I think is worth going down (too complex for less gains).
Update: Requirements 2 and 3 are non-issues considering Nominal Animal's answer. Of course this implies that the assumption is incorrect, as he proved is almost the case (overhead is minimal). I also relaxed requirement 1, O_TMPFILE is indeed perfect for this.
Update: A recent article on LWN mentions, somewhere in the middle: "That could possibly be done with a special write operation that would not actually cause I/O, or with a system call that would transfer a physical page into the page cache". That suggests that indeed, there is currently (April 2014) no way to do this at least with Linux (and likely other operating systems), much less with a standard API. The article is about PostgreSQL, but the issue in question is identical, except perhaps for the specific requirements to this question, which aren't defined in the article.

This is not a satisfactory answer to the question; is is more of a continuation of the comment chain.
Here is a test program one can use to measure the overhead of using a file-backed memory map, instead of an anonymous memory map.
Note that the work() function listed just fills in the memory map with random data. To be more realistic, it should simulate at least the access patterns expected from real-world usage.
#define _POSIX_C_SOURCE 200809L
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <time.h>
#include <stdint.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>
/* Xorshift random number generator.
*/
static uint32_t xorshift_state[4] = {
123456789U,
362436069U,
521288629U,
88675123U
};
static int xorshift_setseed(const void *const data, const size_t len)
{
uint32_t state[4] = { 0 };
if (len < 1)
return ENOENT;
else
if (len < sizeof state)
memcpy(state, data, len);
else
memcpy(state, data, sizeof state);
if (state[0] || state[1] || state[2] || state[3]) {
xorshift_state[0] = state[0];
xorshift_state[1] = state[1];
xorshift_state[2] = state[2];
xorshift_state[3] = state[3];
return 0;
}
return EINVAL;
}
static uint32_t xorshift_u32(void)
{
const uint32_t temp = xorshift_state[0] ^ (xorshift_state[0] << 11U);
xorshift_state[0] = xorshift_state[1];
xorshift_state[1] = xorshift_state[2];
xorshift_state[2] = xorshift_state[3];
return xorshift_state[3] ^= (temp >> 8U) ^ temp ^ (xorshift_state[3] >> 19U);
}
/* Wallclock timing functions.
*/
static struct timespec wallclock_started;
static void wallclock_start(void)
{
clock_gettime(CLOCK_REALTIME, &wallclock_started);
}
static double wallclock_stop(void)
{
struct timespec wallclock_stopped;
clock_gettime(CLOCK_REALTIME, &wallclock_stopped);
return difftime(wallclock_stopped.tv_sec, wallclock_started.tv_sec)
+ (double)(wallclock_stopped.tv_nsec - wallclock_started.tv_nsec) / 1000000000.0;
}
/* Accessor function. This needs to read/modify/write the mapping,
* simulating the actual work done onto the mapping.
*/
static void work(void *const area, size_t const length)
{
uint32_t *const data = (uint32_t *)area;
size_t size = length / sizeof data[0];
size_t i;
/* Add xorshift data. */
for (i = 0; i < size; i++)
data[i] += xorshift_u32();
}
int main(int argc, char *argv[])
{
long page, size, delta, maxsize, steps;
int fd, result;
void *map, *old;
char dummy;
double seconds;
page = sysconf(_SC_PAGESIZE);
if (argc < 5 || argc > 6 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
fprintf(stderr, " %s MAPFILE SIZE DELTA MAXSIZE [ SEEDSTRING ]\n", argv[0]);
fprintf(stderr, "Where:\n");
fprintf(stderr, " MAPFILE backing file, '-' for none\n");
fprintf(stderr, " SIZE initial map size\n");
fprintf(stderr, " DELTA map size change\n");
fprintf(stderr, " MAXSIZE final size of the map\n");
fprintf(stderr, " SEEDSTRING seeds the Xorshift PRNG\n");
fprintf(stderr, "Note: sizes must be page aligned, each page being %ld bytes.\n", (long)page);
fprintf(stderr, "\n");
return 1;
}
if (argc >= 6) {
if (xorshift_setseed(argv[5], strlen(argv[5]))) {
fprintf(stderr, "%s: Invalid seed string for the Xorshift generator.\n", argv[5]);
return 1;
} else {
fprintf(stderr, "Xorshift initialized with { %lu, %lu, %lu, %lu }.\n",
(unsigned long)xorshift_state[0],
(unsigned long)xorshift_state[1],
(unsigned long)xorshift_state[2],
(unsigned long)xorshift_state[3]);
fflush(stderr);
}
}
if (sscanf(argv[2], " %ld %c", &size, &dummy) != 1) {
fprintf(stderr, "%s: Invalid map size.\n", argv[2]);
return 1;
} else
if (size < page || size % page) {
fprintf(stderr, "%s: Map size must be a multiple of page size (%ld).\n", argv[2], page);
return 1;
}
if (sscanf(argv[3], " %ld %c", &delta, &dummy) != 1) {
fprintf(stderr, "%s: Invalid map size change.\n", argv[2]);
return 1;
} else
if (delta % page) {
fprintf(stderr, "%s: Map size change must be a multiple of page size (%ld).\n", argv[3], page);
return 1;
}
if (delta) {
if (sscanf(argv[4], " %ld %c", &maxsize, &dummy) != 1) {
fprintf(stderr, "%s: Invalid final map size.\n", argv[3]);
return 1;
} else
if (maxsize < page || maxsize % page) {
fprintf(stderr, "%s: Final map size must be a multiple of page size (%ld).\n", argv[4], page);
return 1;
}
steps = (maxsize - size) / delta;
if (steps < 0L)
steps = -steps;
} else {
maxsize = size;
steps = 0L;
}
/* Time measurement includes the file open etc. overheads.
*/
wallclock_start();
if (strlen(argv[1]) < 1 || !strcmp(argv[1], "-"))
fd = -1;
else {
do {
fd = open(argv[1], O_RDWR | O_CREAT | O_EXCL, 0600);
} while (fd == -1 && errno == EINTR);
if (fd == -1) {
fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
return 1;
}
do {
result = ftruncate(fd, (off_t)size);
} while (result == -1 && errno == EINTR);
if (result == -1) {
fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
unlink(argv[1]);
do {
result = close(fd);
} while (result == -1 && errno == EINTR);
return 1;
}
result = posix_fadvise(fd, 0, size, POSIX_FADV_RANDOM);
}
/* Initial mapping. */
if (fd == -1)
map = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
else
map = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0);
if (map == MAP_FAILED) {
fprintf(stderr, "Memory map failed: %s.\n", strerror(errno));
if (fd != -1) {
unlink(argv[1]);
do {
result = close(fd);
} while (result == -1 && errno == EINTR);
}
return 1;
}
result = posix_madvise(map, size, POSIX_MADV_RANDOM);
work(map, size);
while (steps-->0L) {
if (fd != -1) {
do {
result = ftruncate(fd, (off_t)(size + delta));
} while (result == -1 && errno == EINTR);
if (result == -1) {
fprintf(stderr, "%s: Cannot grow file: %s.\n", argv[1], strerror(errno));
unlink(argv[1]);
do {
result = close(fd);
} while (result == -1 && errno == EINTR);
return 1;
}
result = posix_fadvise(fd, 0, size, POSIX_FADV_RANDOM);
}
old = map;
map = mremap(map, size, size + delta, MREMAP_MAYMOVE);
if (map == MAP_FAILED) {
fprintf(stderr, "Cannot remap memory map: %s.\n", strerror(errno));
munmap(old, size);
if (fd != -1) {
unlink(argv[1]);
do {
result = close(fd);
} while (result == -1 && errno == EINTR);
}
return 1;
}
size += delta;
result = posix_madvise(map, size, POSIX_MADV_RANDOM);
work(map, size);
}
/* Timing does not include file renaming.
*/
seconds = wallclock_stop();
munmap(map, size);
if (fd != -1) {
unlink(argv[1]);
do {
result = close(fd);
} while (result == -1 && errno == EINTR);
}
printf("%.9f seconds elapsed.\n", seconds);
return 0;
}
If you save the above as bench.c, you can compile it using
gcc -W -Wall -O3 bench.c -lrt -o bench
Run it without parameters to see the usage.
On my machine, on ext4 filesystem, running tests
./bench - 4096 4096 4096000
./bench testfile 4096 4096 4096000
yields 1.307 seconds wall clock time for the anonymous memory map, and 1.343 seconds for the file-backed memory map, meaning the file backed mapping is about 2.75% slower.
This test starts with one page memory map, then enlarges it by one page a thousand times. For tests like 4096000 4096 8192000 the difference is even smaller. The time measured does include constructing the initial file (and using posix_fallocate() to allocate the blocks on disk for the file).
Running the test on tmpfs, on ext4 over swRAID0, and on ext4 over swRAID1, on the same machine, does not seem to affect the results; all differences are lost in the noise.
While I would prefer to test this on multiple machines and kernel versions before making any sweeping statements, I do know something about how the kernel manages these memory maps. Therefore, I shall make the following claim, based on above and my own experience:
Using a file-backed memory map will not cause a significant slowdown compared to an anonymous memory map, or even compared to malloc()/realloc()/free(). I expect the difference to be under 5% in all real-world use cases, and at most 1% for typical real-world use cases; less, if the resizes are rare compared to how often the map is accessed.
To user2266481 the above means it should be acceptable to just create a temporary file on the target filesystem, to hold the memory map. (Note that it is possible to create the temporary file without allowing anyone access to it, mode 0, as access mode is only checked when opening the file.) When the contents are in final form, ftruncate() and msync() the contents, then hard-link the final file to the temporary file using link(). Finally, unlink the temporary file and close the temporary file descriptor, and the task should be completed with near-optimal efficiency.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Unable to find MBR type - c

Related

how to delete the last created file and rename the other files in sequence in c programming

Why can I not mmap /proc/self/maps?

Pinning user space buffer for DMA from Linux kernel

FFmpeg: unspecified pixel format when opening video with custom context

Is it possible to map an existing buffer to a new file?

Categories

Resources