C read part of file into cache - c

I have to do a program (for Linux) where there's an extremely large index file and I have to search and interpret the data from the file. Now the catch is, I'm only allowed to have x-bytes of the file cached at any time (determined by argument) so I have to remove certain data from the cache if it's not what I'm looking for.
If my understanding is correct, fopen (r) doesn't put anything in the cache, only when I call getc or fread(specifying size) does it get cached.
So my question is, lets say I use fread and read 100 bytes but after checking it, only 20 of the 100 bytes contains the data I need; how would I remove the useless 80 bytes from cache (or overwrite it) in order to read more from the file.
EDIT By caching I mean data stored in memory, which makes the problem easier

fread's first argument is a pointer to a block of memory. So the way to go about this is to set that pointer to the stuff you want to over write. For example lets say you want to keep bytes 20-40 and overwrite everything else. You could either a) invoke fread on start with a length of 20 then invoke it again on buffer[40] with a size of 60. or b) You could start by defragmenting (ie copy the bytes you want to keep to the start) then invoke fread with a pointer to the next section.

Why do you want to micromanage the cache? Secondly, what makes you think you can? No argument specified on the command line of your program can control what the cache manager does internally - it may decide to read an entire file into RAM, it may decide to read none of it, or it may decide to throw a party. Any control you have over it would use low-level APIs/syscalls and would not very granular.

I think you might be confused about the requirements, or maybe the person who gave them to you. You seem to be referring to the cache managed by the operating system, which there is no need for an application to ever have to worry about. The operating system will make sure it doesn't grow too large automatically.
The other meaning of "cache" is the one you create yourself, the char* buffer or whatever you create to temporarily hold the data in memory while you process it. This one should be fairly easy to manage yourself simply by not allocating too much memory for that buffer.

To discard the read buffer of a file opened with fopen(), you can use fflush(). Also note that you can control the buffer size with setvbuf().
You should consider using open/read (instead of fopen/fread) if you must have exact control over buffering, though.

Related

How to check if a file of given length can be created?

I want to create a non-sparse file of a given length (i.e. 2GB), but I want to check if that is possible before actually writing stuff to disk.
In other words I want to avoid getting ENOSPC (No space left on device) while writing. I'd prefer not to create a "test file" of size 2GB or things like that just to check that there is enough space left.
Is that possible?
Use posix_fallocate(3).
From the description:
The function posix_fallocate() ensures that disk space is allocated
for the file referred to by the descriptor fd for the bytes in the
range starting at offset and continuing for len bytes. After a
successful call to posix_fallocate(), subsequent writes to bytes in
the specified range are guaranteed not to fail because of lack of
disk space
You can use the statvfs function to determine how much free bytes (and inodes) a given filesystem has.
That should be enough for a quick check, but do remember that it's not a guarantee that you'll be able to write as much (or, for that matter, that writing more than that would have failed) - other applications could also be writing to (or deleting from) the same filesystem. So do continue to check for various write errors.
fallocate or posix_fallocate can be used to allocate (and deallocate) a large chunk. Probably a better option for your use-case. (Check the man page, there's a lot of options for space management that you might find interesting.)

Linux C Standard I/O - why double copying

Assuming I understand the flow correctly, one would like to read few byes off an opened FILE stream, lets says, using fread:
the read syscall will copy the data from the kernel to the user space buffer
user space buffer (either allocated by glibc or provided by setvbuf...) will be copied to the buffer provided to fread
why is the 2nd step needed? why can I get a pointer to the user space buffer which I will decide if I want to store (copy) or not?
Thanks,
The purpose of the 2nd buffer is to amortize the system call overhead. If you read/write only a few bytes at a time, this second user space buffer will improve performance enormously. OTOH if you read/write a large chunk, the 2nd buffer can be bypassed, so you don't pay the price for double copying.
The second step is what it is all about. Kernel must take care of such operations. The api you use will be feeded afterwards with the result. This is usual kernelspace/user space behaviour. Read about it. You perhaps might not know it NOW but kernel space/ user space differentiation are basics of os infrastructure.

reading data from filesystem vs compiling the data directly into program

I have a file (10-20MB) containing data, where each line is a single piece of data.
I have a C program that reads the file from the filesystem, and then based on command line input, it reads each line of the file, does a calculation on each line to determine if that line should be returned, and then return a subset of the data.
Assume that the program does an fread and reads the entire file into memory at the beginning, and then parses it directly from memory.
Would the program execute faster if, instead of reading it from the filesystem, I compiled the data into the program directly, by creating an array such as the following?
char *dataArray[] = {"data1", "data2", "data3"....};
Since the OS needs to read the entire binary from the filesystem, my gut feeling is that the execution time of both techniques would be similar, since reading from the filesystem would be the high order bit. However, would anyone have more definitive ideas on this?
Defining everything as a program literal will certainly be faster.
You do not need the relatively slow "open" call for the data file and you don't need to move the data from the buffer to your storage.
This was a common optimization circa. 1970, and every programming/coding style book since then stongly recommends you do not do this. The actual performance increase is minimal and what you gain in performance you lose in maintainability and flexibility.
Should you want a quick maintainable optimisation for this type of problem then look at the "mmap" call which makes the buffer directly available to your program and minimises data movement.
I doubt the difference in execution time will be significant, but from a memory utilization standpoint, putting the data in the executable (and qualifying it const appropriately) will make a big difference.
If you read 10-20 megs of data from a file into memory allocated (e.g. via malloc) in your program, the data initially exists in two places in memory: the filesystem cache, and your program's private memory. The former copy can be discarded if memory is tight, but the latter occupies physical memory or swap permanently until it's freed.
If on the other hand the 10-20 megs of data are part of your program's image (in the executable file), the data will be demand-paged, and can be discarded whenever needed because the OS knows it can reload the pages if it needs them again.

what's the proper buffer size for 'write' function?

I am using the low-level I/O function 'write' to write some data to disk in my code (C language on Linux). First, I accumulate the data in a memory buffer, and then I use 'write' to write the data to disk when the buffer is full. So what's the best buffer size for 'write'? According to my tests it isn't the bigger the faster, so I am here to look for the answer.
There is probably some advantage in doing writes which are multiples of the filesystem block size, especially if you are updating a file in place. If you write less than a partial block to a file, the OS has to read the old block, combine in the new contents and then write it out. This doesn't necessarily happen if you rapidly write small pieces in sequence because the updates will be done on buffers in memory which are flushed later. Still, once in a while you could be triggering some inefficiency if you are not filling a block (and a properly aligned one: multiple of block size at an offset which is a multiple of the block size) with each write operation.
This issue of transfer size does not necessarily go away with mmap. If you map a file, and then memcpy some data into the map, you are making a page dirty. That page has to be flushed at some later time: it is indeterminate when. If you make another memcpy which touches the same page, that page could be clean now and you're making it dirty again. So it gets written twice. Page-aligned copies of multiples-of a page size will be the way to go.
You'll want it to be a multiple of the CPU page size, in order to use memory as efficiently as possible.
But ideally you want to use mmap instead, so that you never have to deal with buffers yourself.
You could use BUFSIZ defined in <stdio.h>
Otherwise, use a small multiple of the page size sysconf(_SC_PAGESIZE) (e.g. twice that value). Most Linux systems have 4Kbytes pages (which is often the same as or a small multiple of the filesystem block size).
As other replied, using the mmap(2) system call could help. GNU systems (e.g. Linux) have an extension: the second mode string of fopen may contain the latter m and when that happens, the GNU libc try to mmap.
If you deal with data nearly as large as your RAM (or half of it), you might want to also use madvise(2) to fine-tune performance of mmap.
See also this answer to a question quite similar to yours. (You could use 64Kbytes as a reasonable buffer size).
The "best" size depends a great deal on the underlying file system.
The stat and fstat calls fill in a data structure, struct stat, that includes the following field:
blksize_t st_blksize; /* blocksize for file system I/O */
The OS is responsible for filling this field with a "good size" for write() blocks. However, it's also important to call write() with memory that is "well aligned" (e.g., the result of malloc calls). The easiest way to get this to happen is to use the provided <stdio.h> stream interface (with FILE * objects).
Using mmap, as in other answers here, can also be very fast for many cases. Note that it's not well suited to some kinds of streams (e.g., sockets and pipes) though.
It depends on the amount of RAM, VM, etc. as well as the amount of data being written. The more general answer is to benchmark what buffer works best for the load you're dealing with, and use what works the best.

Is there a way to pre-emptively avoid a segfault?

Here's the situation:
I'm analysing a programs' interaction with a driver by using an LD_PRELOADed module that hooks the ioctl() system call. The system I'm working with (embedded Linux 2.6.18 kernel) luckily has the length of the data encoded into the 'request' parameter, so I can happily dump the ioctl data with the right length.
However quite a lot of this data has pointers to other structures, and I don't know the length of these (this is what I'm investigating, after all). So I'm scanning the data for pointers, and dumping the data at that position. I'm worried this could leave my code open to segfaults if the pointer is close to a segment boundary (and my early testing seems to show this is the case).
So I was wondering what I can do to pre-emptively check whether the current process owns a particular offset before trying to dereference? Is this even possible?
Edit: Just an update as I forgot to mention something that could be very important, the target system is MIPS based, although I'm also testing my module on my x86 machine.
Open a file descriptor to /dev/null and try write(null_fd, ptr, size). If it returns -1 with errno set to EFAULT, the memory is invalid. If it returns size, the memory is safe to read. There may be a more elegant way to query memory validity/permissions with some POSIX invention, but this is the classic simple way.
If your embedded linux has the /proc/ filesystem mounted, you can parse the /proc/self/maps file and validate the pointer/offsets against that. The maps file contains the memory mappings of the process, see here
I know of no such possibility. But you may be able to achieve something similar. As man 7 signal mentions, SIGSEGV can be caught. Thus, I think you could
Start with dereferencing a byte sequence known to be a pointer
Access one byte after the other, at some time triggering SIGSEGV
In SIGSEGV's handler, mark a variable that is checked in the loop of step 2
Quit the loop, this page is done.
There's several problems with that.
Since several buffers may live in the same page, you might output what you think is one buffer that are, in reality, several. You may be able to help with that by also LD_PRELOADing electric fence which would, AFAIK cause the application to allocate a whole page for every dynamically allocated buffer. So you would not output several buffers thinking it is only one, but you still don't know where the buffer ends and would output much garbage at the end. Also, stack based buffers can't be helped by this method.
You don't know where the buffers end.
Untested.
Can't you just check for the segment boundaries? (I'm guessing by segment boundaries you mean page boundaries?)
If so, page boundaries are well delimited (either 4K or 8K) so simple masking of the address should deal with it.

Resources