simple_copy example on pmem.io - c

I have created the emulated device given at http://pmem.io/2016/02/22/pm-emulation.html, successfully.
It shows the device correctly:
:~/Prakash/nvml/src/examples/libpmem$ mount | grep pmem
/dev/pmem0 on /mnt/pmemd type ext4 (rw,relatime,dax,errors=continue,data=ordered)
However, when I execute the simple_copy sample given with pmem nvml, it gives this error:
amd#amd:~/Prakash/nvml/src/examples/libpmem$ ./simple_copy logs
/dev/pmem0 pmem_map_file: File exists
amd#amd:~/Prakash/nvml/src/examples/libpmem$ ./simple_copy logs
/dev/pmem0/logs pmem_map_file: Not a directory
Am I not using the program correctly?
Also, I have mounted the device as dax and I clearly see the performance advantage with
:~/Prakash/nvml/src/examples/libpmem$ sudo dd if=/dev/zero of=/dev/pmem0 bs=2G count=1
0+1 records in
0+1 records out
2147479552 bytes (2.1 GB, 2.0 GiB) copied, 0.910729 s, 2.4 GB/s
:~/Prakash/nvml/src/examples/libpmem$ sudo dd if=/dev/zero of=/mnt/pmem0/test bs=2G count=1
0+1 records in
0+1 records out
2147479552 bytes (2.1 GB, 2.0 GiB) copied, 6.39032 s, 336 MB/s

from the errors posted, is seems reasonable to believe:
without the appropriate option, it will not create a directory
without the appropriate option, it will not replace a file

If you open the example you are referring to, you will see the following:
if ((pmemaddr = pmem_map_file(argv[2], BUF_LEN,
PMEM_FILE_CREATE|PMEM_FILE_EXCL,
0666, &mapped_len, &is_pmem)) == NULL) {
perror("pmem_map_file");
exit(1);
}
This is the part that is giving you trouble. To understand why, let's look at the man 7 libpmem. You can find the relevant part here.
This is the paragraph we are interested in:
The pmem_map_file() function creates a new read/write mapping for a
file. If PMEM_FILE_CREATE is not specified in flags, the entire
existing file path is mapped, len must be zero, and mode is ignored.
Otherwise, path is opened or created as specified by flags and mode,
and len must be non-zero. pmem_map_file() maps the file using mmap(2),
but it also takes extra steps to make large page mappings more likely.
So, the pmem_map_file function effectively calls open(2) and then mmap(2). In the simple_copy.c example we can observe that the flags which were used are: PMEM_FILE_CREATE and PMEM_FILE_EXCL, and as we can learn from the manpage, they roughly translate to O_CREAT and O_EXCL respectively.
This means that the error messages are correct and you've received them because in your first attempt you've provided an existing file, whilst on the second attempt you tried a directory.
There's an in-depth explanation of libpmem here.

Related

What happens if you try to read/write a mapping with a deleted / disconnected backing file or device?

If I perform a mmap() on some file or a device in /dev/ that exposes memory, what happens to that mapping if the file is deleted or that device disconnected?
I've written a test program to experiment but I can't find any solid documentation on what should happen. The test program does the following
#include <stdio.h>
#include <fcntl.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/mman.h>
int main()
{
int fd = open("test_file", O_CREAT | O_RDWR);
uint32_t data[10000] = {0};
data[3] = 22;
write(fd, data, sizeof data);
sync();
struct stat s;
fstat(fd, &s);
uint32_t *map = (uint32_t *)mmap(NULL, s.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
close(fd);
sleep(10);
printf("%u\n", *(map+3));
*(map+9000) = 91;
printf("%u\n", *(map+9000));
return 0;
}
My page size is 4096 bytes so I made the file larger than that so the map would span multiple pages. During that 10s sleep I run the following commands:
$ rm test_file
$ sync; echo 3 > /proc/sys/vm/drop_caches
Which I believe should destroy the backing file of the map and remove all pages so that it must seek the backing file out to perform any operations on it. Then after that 10s sleep the program attempts to read from page 1 and w/r to page 3.
Surprisingly the 2 values I get printed back out are 22 and 91
Why does this happen, is this behavior guaranteed or undefined? Is it because I was using a regular file and not a device? What if the mapping was really large, would things change? Should I expect to get a SIGSEGV or a SIGBUS under some conditions?
rm just unlinks the file from a location in the filesystem (removes it from a directory). If there are other references to the file (such as a process that has it open), the file won't actually be removed -- the OS keeps a reference count of all the references to the file and only deletes it when the reference count drop to 0.
So in this case, the reference count will still be non-zero after the rm as the process has the file open. Only when the files is unmapped and closed (which happens when the process exits) will the file actually be deleted.
In the case of a device, the device file (in the filesystem) is similarly just a reference to the device driver. Removing it won't have any effect. However, if the device itself has some concept of being removed (such as removable storage), doing that will result in future access returning some error.
What happens if you try to read/write a mapping with a deleted ... file
You will still write to that file. rm only unlinks the name from the directory, the file still exists.
disconnected backing ... device?
The process will receive a SIGBUS signal.
Why does this happen, is this behavior guaranteed or undefined?
Guaranteed, files keep reference count since always.
Is it because I was using a regular file and not a device?
No. A device kind of is a regular file. You can open("/dev/sda" and write to it and rm /dev/sda. In Linux almost everything is a regular file. A file has only one more layer of indirection in kernel - a filesystem.
What if the mapping was really large, would things change?
Should I expect to get a SIGSEGV or a SIGBUS under some conditions?
See man mmap. Search for SIGBUS.

Is there a way to calculate I/O and memory of current process in C?

If I use
/usr/bin/time -f"%e,%P,%M,%I,%O"
I get (for the last three placeholders) the memory the process used, and if there was some input and output during it.
Obviously, it's easy to get %e or something like it using sys/time.h, but is there a way to get %M, %I and %O programmatically?
You could read and parse the files in the /proc filesystem. /proc/self refers to the process accessing the /proc filesystem.
/proc/self/statm contains information about memory usage, measured in pages. Sample output:
% cat /proc/self/statm
1115 82 63 12 0 79 0
Fields are size resident share text lib data dt; see the proc manual page for some additional details.
/proc/self/io contains the I/O for the current process. Sample output:
% cat /proc/self/io
rchar: 2012
wchar: 0
syscr: 6
syscw: 0
read_bytes: 0
write_bytes: 0
cancelled_write_bytes: 0
Unfortunately, io isn't documented in the proc manual page (at least on my Debian system). I had too check the iotop source code to see how it obtained the per process I/O information.

Linux programming: which device a file is in

I would like to know which entry under /dev a file is in. For example, if /dev/sdc1 is mounted under /media/disk, and I ask for /media/disk/foo.txt, I would like to get /dev/sdc as response.
Using stat system call on that file I will get its partition major and minor numbers (8 and 33, for sdc1). Now I need to get the "root" device (sdc) or its major/minor from that. Is there any syscall or library function I could use to link a partition to its main device? Or even better, to get that device directly from the file?
brw-rw---- 1 root floppy 8, 32 2011-04-01 20:00 /dev/sdc
brw-rw---- 1 root floppy 8, 33 2011-04-01 20:00 /dev/sdc1
Thanks in advance!
The quick and dirty version: df $file | awk 'NR == 2 {print $1}'.
Programmatically... well, there's a reason I started with the quick and dirty version. There's no portable way to programmatically get the list of mounted filesystems. (getmntent() gets fstab entries, which is not the same thing.) Moreover, you can't even parse the output of mount(8) reliably; on different Unixes, the mountpoint may be the first or the last item. The most portable way to do this ends up being... parsing df output (And even that is iffy, as you noticed with the partition number.). So you're right back to the quick and dirty shell solution anyway, unless you want to traverse /dev and look for block devices with matching major(st_rdev) (major() being from sys/types.h).
If you restrict this to Linux, you can use /proc/mounts to get the list of mounted filesystems. Other specific Unixes can similarly be optimized: for example, on OS X and I think FreeBSD, you can use sysctl() on the vfs tree to get mountpoints. At worst you can find and use the appropriate header file to decipher whatever the mount table file is (and yes, even that varies: on Solaris it's /etc/mnttab, on many other systems it's /etc/mtab, some systems put it in /var/run instead of /etc, and on many Linuxes it's either nonexistent or a symlink to /proc/mounts). And its format is different on pretty much every Unix-like OS.
The information you want exists in sysfs which exposes the linux device tree. This models the relationships between the devices on the system and since you are trying to determine a parent disk device from a partition, this is the place to look. I don't know if there are any hard and fast rules you can rely on to stop your code breaking with future versions of the kernel, but the kernel developers do try to maintain sysfs as a stable interface.
If you look at /sys/dev/block/<major>:<minor>, you'll see it is a symlink with the tail components being block/<disk-device-name>/<partition-device-name>. If you were to perform a readlink(2) system call on that, you could parse the link destination to get the disk device name. In shell (since it's easier to express this way, but doing it in C will be pretty easy):
$ echo $(basename $(dirname $(readlink /sys/dev/block/8:33)))
sdc
Alternatively, you could take advantage of the nesting of partition directories in the disk directories (again in shell, but from C, its an open(2), read(2), and close(2)):
$ cat /sys/dev/block/8:33/../dev
8:32
That assumes your starting major:minor is actually for a partition, not some other sort of non-nested device.
What you looking for is impossible - there is no 1:1 connection between a block device file and the partition it is describing.
Consider:
You can create multiple block device files with different names (but the same major and minor numbers) and they are indistinguishable (N:1)
You can use a block device file as an argument to mount to mount a partition and then delete the block device file leaving the partition mounted. (0:1)
So there is no way to do what you want except in a few specific and narrow cases.
Major number will tell you which device it is: 3 - IDE on 1st controller, 22 - IDE on 2nd controller and 8 for SCSI.
Minor number will tell you partition number and - for IDE devices - if it's primary or secondary drive. This calculation is different for IDE and SCSI.
For IDE it is: x*64 + p, x is drive number on the controller (0 or 1) and p is partition
For SCSI it is: y*16 + p, where y is drive number and p is partition
Not a syscall, but:
df -h /path/to/my/file
From https://unix.stackexchange.com/questions/128471/determine-what-device-a-directory-is-located-on
So you could look at df's source code and see what it does.
I realize this post is old, but this question was the 2nd result in my search and no one has mentioned df -h

How many files can i have opened at once?

On a typical OS how many files can i have opened at once using standard C disc IO?
I tried to read some constant that should tell it, but on Windows XP 32 bit that was a measly 20 or something. It seemed to work fine with over 30 though, but i haven't tested it extensively.
I need about 400 files opened at once at max, so if most modern OS's support that, it would be awesome. It doesn't need to support XP but should support Linux, Win7 and recent versions of Windows server.
The alternative is to write my own mini file system which i want to avoid if possible.
On Linux, this is dependent on the amount of available file descriptors.
You can use ulimit -n to set / show the number of available FD's per shell.
See these instructions to how to check (or change) the value of available total FD:s in Linux.
This IBM support article suggests that on Windows the number is 512, and you can change it in the registry (as instructed in the article)
As open() returns the fd as int - size of int limits also the upper limit.
(irrelevant as INT_MAX is a lot)
A process can query the limit using the getrlimit system-call.
#include<sys/resource.h>
struct rlimit rlim;
getrlimit(RLIMIT_NOFILE, &rlim);
printf("Max number of open files: %d\n", rlim.rlim_cur-1);
FYI, as root, you have first to modify the 'nofile' item in /etc/security/limits.conf . For example:
* hard nofile 10240
* soft nofile 10240
(changes in limits.conf typically take effect when the user logs in)
Then, users can use the ulimit -n bash command. I've tested this with up to 10,240 files on Fedora 11.
ulimit -n <max_number_of_files>
Lastly, all this is limited by the kernel limit, given by: (I guess you could echo a value into this to go even higher... at your own risk)
cat /proc/sys/fs/file-max
Also, see http://www.karakas-online.de/forum/viewtopic.php?t=9834

How do I create a sparse file programmatically, in C, on Mac OS X?

I'd like to create a sparse file such that all-zero blocks don't take up actual disk space until I write data to them. Is it possible?
There seems to be some confusion as to whether the default Mac OS X filesystem (HFS+) supports holes in files. The following program demonstrates that this is not the case.
#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
void create_file_with_hole(void)
{
int fd = open("file.hole", O_WRONLY|O_TRUNC|O_CREAT, 0600);
write(fd, "Hello", 5);
lseek(fd, 99988, SEEK_CUR); // Make a hole
write(fd, "Goodbye", 7);
close(fd);
}
void create_file_without_hole(void)
{
int fd = open("file.nohole", O_WRONLY|O_TRUNC|O_CREAT, 0600);
write(fd, "Hello", 5);
char buf[99988];
memset(buf, 'a', 99988);
write(fd, buf, 99988); // Write lots of bytes
write(fd, "Goodbye", 7);
close(fd);
}
int main()
{
create_file_with_hole();
create_file_without_hole();
return 0;
}
The program creates two files, each 100,000 bytes in length, one of which has a hole of 99,988 bytes.
On Mac OS X 10.5 on an HFS+ partition, both files take up the same number of disk blocks (200):
$ ls -ls
total 400
200 -rw------- 1 user staff 100000 Oct 10 13:48 file.hole
200 -rw------- 1 user staff 100000 Oct 10 13:48 file.nohole
Whereas on CentOS 5, the file without holes consumes 88 more disk blocks than the other:
$ ls -ls
total 136
24 -rw------- 1 user nobody 100000 Oct 10 13:46 file.hole
112 -rw------- 1 user nobody 100000 Oct 10 13:46 file.nohole
As in other Unixes, it's a feature of the filesystem. Either the filesystem supports it for ALL files or it doesn't. Unlike Win32, you don't have to do anything special to make it happen. Also unlike Win32, there is no performance penalty for using a sparse file.
On MacOS, the default filesystem is HFS+ which does not support sparse files.
Update: MacOS used to support UFS volumes with sparse file support, but that has been removed. None of the currently supported filesystems feature sparse file support.
This thread becomes a comprehensive source of info about the sparse files. Here is the missing part for Win32:
Decent article with examples
Tool that estimates if it makes sense to make file as sparse
Regards
hdiutil can handle sparse images and files but unfortunately the framework it links against is private.
You could try defining external symbols as defined by the DiskImages framework below but this is most likely not acceptable for production code, plus since the framework is private you'd have to reverse engineer its use cases.
cristi:~ diciu$ otool -L /usr/bin/hdiutil
/usr/bin/hdiutil:
/System/Library/PrivateFrameworks/DiskImages.framework/Versions/A/DiskImages (compatibility version 1.0.8, current version 194.0.0)
[..]
cristi:~ diciu$ nm /System/Library/PrivateFrameworks/DiskImages.framework/Versions/A/DiskImages | awk -F' ' '{print $3}' | c++filt | grep -i sparse
[..]
CSparseFile::sector2Band(long long)
CSparseFile::addIndexNode()
CSparseFile::readIndexNode(long long, SparseFileIndexNode*)
CSparseFile::readHeaderNode(CBackingStore*, SparseFileHeaderNode*, unsigned long)
[... cut for brevity]
Later Edit
You could use hdiutil as an external process and have it create an sparse disk image for you. From the C process you would then create a file in the (mounted) sparse disk image.
If you seek (fseek, ftruncate, ...) to past the end, the file size will be increased without allocating blocks until you write to the holes. But there's no way to create a magic file that automatically converts blocks of zeroes to holes. You have to do it yourself.
This may be helpful to look at (the OpenBSD cp command inserts holes instead of writing zeroes).
patch
If you want portability, the last resort is to write your own access function so that you manage an index and a set of blocks.
In essence you manage a single file as the OS manages the disk keeping the chain of the blocks that are part of the file, the bitmap of allocated/free blocks etc.
Of course this will lead to a non optimized and slower access, I would reccomend this apprach only if the requirement to save space is absolutely critical and you have enough time to write a robust set of access functions.
And even in that case, I would first investigate if your problem is in need of a different solution. Probably you should store your data differently?
It looks like OS X supports sparse files on UDF volumes. I tried titaniumdecoy's test program on OS X 10.9 and it did generate a sparse file on a UDF disk image. Also, not that UFS is no longer supported in OS X, so if you need sparse files, UDF is the only natively supported file system that supports them.
I also tried the program on SMB shares. When the server is Ubuntu (ext4 filesystem) the program creates a sparse file, but 'ls -ls' through SMB doesn't show that. If you do 'ls -ls' on the Ubuntu host itself it does show the file is sparse. When the server is Windows XP (NTFS filesystem) the program does not generate a sparse file.

Resources