Accessing block device data beyond reported capacity - c

I have a SATA block device that reports a capacity that is smaller than its accessible space, and I would like to read and write to it past the reported capacity using the file created by Linux for block devices. So I hope to operate using the descriptor returned from open({"/dev/sda", O_RDWR). However, when I try to use lseek to seek past the capacity of the device, I get an error and errno gets set to EINVAL (22).
Is there a way to access the data past the capacity of the device without modifying the device drivers and while still using the file descriptor returned by open()?
My Linux release is CentOS 7 with kernel 3.10.0-514.21.1.el7.x86_64, although I'd be interested in solutions even if they involve other Linux distributions.
Edit: The drive I am working with is a FLEX protocol drive that reports the conventional capacity of the drive, but also has shingled magnetic recording available at an offset above the reported capacity of the drive. If you are interested, the details of this protocol can be found on the T13 website.

If I remember correctly, that error is caused because the device itself wasn't able to read or write that cylinder, indicating it likely does not exist. Note that many manufacturers use 1000B = 1KB and the likes, and that file systems reserve their own space as well.
The short answer is, you don't. The device will only report the space you can use, and will not report cache sizes either. This misreporting isn't at the OS level, but at the device.

Related

Should I use block device over char device for reading and writing to memory?

I just started to work in a new company and I'm new in the embedded world.
They gave me a task, I have done it and it's working but I don't know if I did it the right way.
I will describe the task and what I have done.
I was requested to hide some small piece of the DDR from the Linux OS, then some HW feature can write something to this small piece of memory I saved. After that I need to be able to read this small piece of memory to a file.
To hide a chunk of the DDR from the Linux I just changed the Linux memory arg to be equal to the real memory size - (the size I needed + some small size for safety). I have got the idea and the idea for the driver I will describe in a sec from this post.
After that the Linux is seeing less memory then the HW has and the top section of the DDR is hided from the kernel and I can use it for my storage without worry.
I think that I have done this part right, not something I can say about the next part.
For the next part, to be able to read this piece of DDR I saved, I wrote a Char device driver, it’s working, it’s reading the DDR chunk I saved to a file piece by piece, every piece is of size no more then some value I decided, can't do it in one copy because it will require allocating a big buffer and I don’t have enough RAM space for that.
Now I read about block device and I started to think that maybe block device fits better for my program, but I'm not relay sure because first it's working and if it's not broken... second I never wrote block device driver, I also never wrote char device driver until the one I described before, so I don't sure if this is the time to use block device over char device.
This depends on the intended use, but according to your description a character device is much more likely to be what you want. The difference:
a character device takes simple read and write commands and gets no help from the kernel. This is suitable for reading or writing from devices (and from anything that resembles a device, both if it is an actual stream that's read sequentially or supports 'seek' and can read the same data over and over again).
a block device hooks into the kernel's memory paging system and is capable of serving as a back-end for virtual memory pages. It can host a swap space, be the storage for a file system, etc. It is a much more complex beast than a character device. You need this only for something that stores a large amount of data that needs to be accessed by mapping it into the address space of a process (normally this is needed only if you put a file system on it).

In Linux, how do I determine optimal value of optmem_max?

We have a Linux project where we are pushing struct information over buffers. Recently, we found that the kernel parameter optmem_max was too small. I was asked to increase this by a supervisor. While I understand how to do this, I don't really understand how I know how big to make this.
Further, I don't really get what optmem_max is.
Here's what the kernel documentation says:
"Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence of struct cmsghdr structures with appended data."
(I don't really understand what this means in English).
I see many examples on the Internet suggesting that this should be increased for better performance.
In:
/etc/sysctl.conf
I added this line to fix the problem:
net.core.optmem_max=1020000
Once this is done, we got better performance.
So to summarize my question:
In English, what is optmem_max?
Why is it so low by default in most Linux distros if making it bigger improves performance?
How does one measure what a good size for this number to be?
What are the ramifications of making this really large?
Aside from /etc/sysctl.conf, where is this set in the kernel by default? I grepped the kernel, but I could find no trace of the default value of optmem_max being set to 20480 which is the default on our system.
In English, what is optmem_max?
optmem_max is a kernel option that affects the memory allocated to the cmsg list maintained by the kernel that contains "extra" packet information like SCM_RIGHTS or IP_TTL.
Increasing this option allows the kernel to allocate more memory as needed for more control messages that need to be sent for each socket connected (including IPC sockets/pipes).
Why is it so low by default in most Linux distros if making it bigger improves performance?
Most distributions have normal users in mind and most normal users, even if using Linux/Unix as a server, do not have a farm of servers that have fiber channels between them or server processes that don't need GB of IPC transfer.
A 20KB buffer is large enough for "most" that it minimizes the kernel memory required by default and is also easily enough configured that one can do so if they need.
How does one measure what a good size for this number to be?
Depends on your usage, but the Arch Wiki suggests a 64KB size for optmem_max and a 16MB size for rmem_max and wmem_max (which are the send and receive buffers).
What are the ramifications of making this really large?
More kernel memory that can be allocated to each socket connected, and maybe unnecessarily.
Aside from /etc/sysctl.conf, where is this set in the kernel by default? I grepped the kernel, but I could find no trace of the default value of optmem_max being set to 20480 which is the default on our system.
I'm not a Linux kernel source aficionado, but it looks like it could be in net/core/sock.c:318.
Hope that can help.

Increasing Linux DMA_ZONE memory on ARM i.MX287

I am working in an Embedded Linux system which has the 2.6.35.3 kernel.
Within the device we require a 4MB+192kB contiguous DMA capable buffer for one of our data capture drivers. The driver uses SPI transfers to copy data into this buffer.
The user space application issues a mmap system call to map the buffer into user space and after that, it directly reads the available data.
The buffer is allocated using "alloc_bootmem_low_pages" call, because it is not possible to allocate more than 4 MB buffer using other methods, such as kmalloc.
However, due to a recent upgrade, we need to increase the buffer space to 22MB+192kB. As I've read, the Linux kernel has only 16MB of DMA capable memory. Therefore, theoretically this is not possible unless there is a way to tweak this setting.
If there is anyone who knows how to perform this, please let me know?
Is this a good idea or will this make the system unstable?
The ZONE_DMA 16MB limit is imposed by a hardware limitation of certain devices. Specifically, on the PC architecture in the olden days, ISA cards performing DMA needed buffers allocated in the first 16MB of the physical address space because the ISA interface had 24 physical address lines which were only capable of addressing the first 2^24=16MB of physical memory. Therefore, device drivers for these cards would allocate DMA buffers in the ZONE_DMA area to accommodate this hardware limitation.
Depending on your embedded system and device hardware, your device either is or isn't subject to this limitation. If it is subject to this limitation, there is no software fix you can apply to allow your device to address a 22MB block of memory, and if you modify the kernel to extend the DMA address space beyond 16MB, then of course the system will become unstable.
On the other hand, if your device is not subject to this limitation (which is the only way it could possibly write to a 22MB buffer), then there is no reason to allocate memory in ZONE_DMA. In this case, I think if you simply replace your alloc_bootmem_low_pages call with an alloc_bootmem_pages call, it should work fine to allocate your 22MB buffer. If the system becomes unstable, then it's probably because your device is subject to a hardware limitation, and you cannot use a 22MB buffer.
It looks like my first attempt at an answer was a little too generic. I think that for the specific i.MX287 architecture you mention in the comments, the DMA zone size is configurable through the CONFIG_DMA_ZONE_SIZE parameter which can be made as large as 32Megs. The relevant configuration option should be under "System Type -> Freescale i.MXS implementations -> DMA memory zone size".
On this architecture, it's seems safe to modify it, as it looks like it's not addressing a hardware limitation (the way it was on x86 architectures) but just determining how to lay out memory.
If you try setting it to 32Meg and testing both alloc_bootmem_pages and alloc_bootmem_low_pages in your own driver, perhaps one of those will work.
Otherwise, I think I'm out of ideas.

NAND RAW access

I'm working with a C++ application in an embedded systems running Linux. This device receives messages (small chunk of few bytes) and need to be stored in a non volatile memory in case of power failure. This worked well with another platform because a static RAM was available.
The problem on this platform is that we only have a NAND Flash to do this and we would like to append different message in the same block without having to erase the whole block before updating it with a new message ! Writing a file per messages is not a good solution because there can be a lot of them ! Moreover, this must be efficient and should be life sparing for the flash by avoiding too much erases ! What I would like to be able to do is writing byte after byte into the flash without worrying about bad blocks.
I found "Petit FAT File System" and I'm wondering if this would suite my needs ... ?
Could someone tell me if this is possible with "Petit FAT File System" or give me any suggestion on how to handle this ?
Thanks !
I haven't looked into Petit file system, but your real limitation is the NAND flash. The manufacture data sheet will likely indicate how many writes you can successfully make to each block, before an erase is required. It's possible that there is no hard limit, but the integrity of the data will not be guaranteed after a max write count.
The answer depends on the process technology and flash cell design. For example, is it SLC or MLC NAND? SLC is going to be able to handle multiple block writes better.
Another question would be what type of flash controller is on your system? If it uses hardware ECC, then you might be limited by the controller, since 2nd writes will invalidate the ECC value of the 1st data write. If it is possible that you can do ECC calculations in software, then it comes back to the NAND limitation.
Small write support might be addressed in the data sheet, via a special set aside memory area that might be provided. So again, check the data sheet.
If you post a link, or indicate what hardware you are using, I can try and give you a more definite answer.
If you are dealing with flash, there's no way around deleting it before writing. All flash memory works in that way. Depending on your real-time requirements and the size of the data, this may or may not be an issue. But since you are using embedded Linux, real-time is probably not a major concern for the application anyhow.
I don't see why you would need a complete file system to store a few bytes?! Why do you need an external memory for this in the first place, can't you write to the internal flash of the MCU? If you just need to store a few bytes, an MCU with on-chip eeprom/data flash would likely suit your needs the best.
Also, that flash circuit doesn't look too promising. First I find it mighty fishy that they don't type out the number of cycles nor the data retention but refer to the "gualification report". This might indicate that the the memory is of poor quality.
And the data sheet says year 2009 and Samsung. If I may be cynical, that probably means that the chip is already obsolete. Samsung doesn't exactly have the best long-life reputation.
I'm curious why you want to use raw flash. Why not use something like JFFS2 or UBIFS on top of the MTD drive? Let the MTD driver manage the ECC while JFFS2 or UBIFS manages the wear-leveling. Then just open one file and write to it whenever you need.

Working with block special files/devices to implement a filesystem

I've implemented a basic filesystem using FUSE, with all foreseeable POSIX functionality implemented [naturally I haven't even profiled yet ;)]. Currently I'm able to run the filesystem on a regular file (st_mode & S_IFREG), but the next step in development is to host it on an actual block device. Running my code as is, immediately fails on reading st_size after calling fstat on the device. Of course I don't expect the problems to stop there so:
What changes are required to operate on block devices as opposed to regular files?
What are some special considerations I need to make with regard to performance, limitations, special features and the like?
Are there any tutorials and references with dealing with block special files? Googling has turned up very little useful; I only have background knowledge (ironically from MSDN in my dark past) and some scanty information in the manpages.
Update0
I've pointed out what I mean by "regular file".
I don't want to concentrate on getting the device size, I want general guidelines for differences between regular files and device files with respect to performance and usage.
Currently I'm able to run the
filesystem on a regularly file, but
the next step in development is to
host it on an actual block device
I don't completely understand what you mean - I assume you are saying that "you currently save your filesystem data to a plain file on a normally mounted filesystem - but now wish to use a raw block device for your data storage".
If so - having done this a few times - I'd advise the following:
Never use an "actual" block device for you filesystem. Always use a partition. There are several rarely-used partition-types that you can use to denote that such a filesystem may be your filesystem type, and that your filesystem can check and mount it if it is such. Thus, you will never be running on something like "/dev/sdb", but rather you will store you data on one such as /dev/sdb1, and assign it some partition type. This has obvious advantages, like allowing your filesystem to be co-resident on a single phyiscal disk as another, etc.
If you are implementing any caching in your filesystem (like Linux does with the Page Cache), do all I/Os to the block devices with O_DIRECT. This requires you to pass page-alligned memory to do all I/O, and requires that the requests be sector/block aligned - but will remove a data copy which would otherwise be required when data is moved from the block device to the page cache, then from the page-cache to your user-space [filesystem] reader.
What do you mean that the fstat "fails"? This is an fstat trying to determing the length of the block device? Do you receive an error? What is it?
block devices behave very much like files - tools like dd can operate on them without any special handling. fstat, though, returns information about the special-file node, not the blockdev it refers to. you probably want to use the BLKGETSIZE64 ioctl to read the size.
there's no particular reason to use a partition over a raw device, though - a blockdev is a blockdev. O_DIRECT is good, as well, assuming your workload won't generate repeated accesses. don't confuse it with a real protocol for ensuring the permanence and atomicity of your filesystem, though (fsync, barriers, etc).

Resources