Buffer size for getdents64 to finish in one go - c

On Linux, is it possible to get the buffer size required for getdents64 to get all the entries in one go (assuming no modifications to the directory after the size is obtained)?
I tried the value from fstat(dirfd,&stb); stb.st_size but it appears unnecessarily too large? What is the value that stat::st_size holds for directories?

As far as I know, no, there is no way to do this, especially not in a manner that works for arbitrary filesystems. The representation of directories is defined by the underlying filesystem, and the same seems to be true of st_size. Certainly there is no way to get the right value for FUSE or 9p or other remote/virtual filesystems.
Why do you want to do this? I don't think it's useful. Once you get beyond a few kB per call, the overhead of the syscalls will be dominated by the actual work done. If you really care you can wrap the getdents64 syscall with a function that keeps resizing the buffer and calling it until EOF is reached. Or you could just use the portable opendir/readdir interfaces which perform appropriate buffering for you.

Related

C: Reading large files with limited memory

I am working on something that requires reading from and writing to a large file (or equivalent) but is allowed fairly minimal memory to do it (I don't have the exact spec, but let's call the "large" 15GB and the "minimal" 16K). The file is accessed randomly, usually in chunks of 512 Bytes and it is guaranteed that sometimes consecutive reads will be significant distance apart - possibly literally opposite ends of the disk (or a small number of MB from either end). Currently I'm using pread/pwrite to hit the locations I want in the file (I was previously using fseek, but abandoned it in favor of p(wread|write) because reasons.
Accessing the file this way is (perhaps obviously) slow, and I'm looking for ways to optimise/speed up the performance as much as possible (with as limited use (read: NO) as possible of external libraries).
I don't mean to be too cagey about exactly what we're doing, so it might help to think of it as a driver for a file system. At one end of the disk we're accessing the file and directory tables, and at the other raw data - so we need to write file information and then skiup to the data. But even within such zones don't assume anything about the layout. There is no guarantee that multiple files (or even multiple chunks of a single file) will be stored contiguously - or even close together. This also means that we can't make assumptions about the order that data will be read.
A couple of things I have considered include:
Opening Multiple File Descriptors for different parts of the file (but I'm not sure there's any state associated with the FD and whether this would even have an impact)
A few smarts around caching data that I expect to be accessed several times in a short amount of time
I was wondering whether others might have been in a similar boat and/or have opinions (or articles they can link) that discuss different strategies to minimise the impact of reading.
I guess I was always wondering whether pread is even the right choice in this situation....
Any thoughts/opinions/criticisms/etc more than welcome.
NOTE: The program will always run in a single thread (so options don't need to be thread-safe, but equally pushing the read to the background isn't an option either).

Using sysctl(3) to write safe, portable code: good idea?

When writing safe code in straight C, I'm sick and tired of coming up with arbitrary
numbers to represent limitations -- specifically, the maximum amount of
memory to allocate for a single line of text. I know I can always say
stuff like
#define MAX_LINE_LENGTH 1024
and then pass that macro to functions such as snprintf().
I work and code in NetBSD, which has a sysctl(3) variable called
"user.line_max" designed for this very purpose. So I don't need to come up
with an arbitrary number like MAX_LINE_LENGTH, above. I just read the
"user.line_max" sysctl variable, which by the way is settable by the user.
My question is whether this is the Right Thing in terms of safety and
portability. Perhaps different operating systems have a different name for
this sysctl, but I'm more interested in whether I should be using this
technique at all.
And for the record, "portability" excludes Microsoft Windows in this case.
Well the linux SYSCTL (2) man page has this to say in the Notes section:
Glibc does not provide a wrapper for this system call; call it using syscall(2).
Or rather... don't call it: use of this system call has long been discouraged, and it is so unloved that it is likely to disappear in a future kernel version. Remove it from your programs now; use the /proc/sys interface instead.
So that is one consideration.
Not a good idea. Even if it weren't for what Duck told you, relying on a system-wide setting that's runtime-variable is bad design and error-prone. If you're going to go to the trouble of having buffer size limits be variable (which typically requires dynamic allocation and checking for failure) then you should go the last step and make it configurable on a more local scope.
With your example of buffer size limits, opinions differ as to what's the best practice. Some people think you should always use dynamically-growing buffers with no hard limit. Others prefer fixed limits sufficiently large that reasonable data would not exceed them. Or, as you've noted, configurable limits are an option. In choosing what's right for your application, I would consider the user experience implications. Sure users don't like arbitrary limits, but they also don't like it when accidentally (or by somebody else's malice) reading data with no newlines in it causes your application to consume unbounded amounts of memory, start swapping, and/or eventually crash or bog down the whole system.
The nearest portable construct for this is "getconf LINE_MAX" or the equivalent C.
1) Check out the Single Unix Specification, keyword: "limits"
2) s/safety/security/

random reading on very large files with fgets seems to bring Windows caching at it's limits

I have written a C/C++-program for Windows 7 - 64bit that works on very large files. In the final step it reads lines from an input-file (10GB+) and writes them to an output file. The access to the input-file is random, the writing is sequential.
EDIT: Main reason for this approach is to reduce RAM usage.
What I basically do in the reading part is this: (Sorry, very shortened and maybe buggy)
void seekAndGetLine(char* line, size_t lineSize, off64_t pos, FILE* filePointer){
fseeko64(filePointer, pos, ios_base::beg);
fgets(line, lineSize, filePointer);
}
Normally this code is fine, not to say fast, but under some very special conditions it gets very slow. The behaviour doesn't seem to be deterministic, since the performance drops occure on different machines at other parts of the file or even don't occure at all. It even goes so far, that the program totally stops reading, while there are no disc-operations.
Another sympthom seems to be the used RAM. My process keeps it's RAM steady, but the RAM used by the System grows sometimes very large. After using some RAM-Tools I found out, that the Windows Mapped File grows into several GBs. This behaviour also seems to depend on the hardware, since it occure on different machines at different parts of the process.
As far as I can tell, this problem doesn't exist on SSDs, so it definitely has something to do with the responsetime of the HDD.
My guess is that the Windows Caching gets somehow "wierd". The program is fast as long as the cache does it's work. But when Caching goes wrong, the behaviour goes either into "stop reading" or "grow cache size" and sometimes even both. Since I'm no expert for the windows caching algorithms, I would be happy to hear an explanation. Also, is there any way to get Windows out of C/C++ to manipulate/stop/enforce the caching.
Since I'm hunting this problem for a while now, I've already tried some tricks, that didn't work out:
filePointer = fopen(fileName, "rbR"); //Just fills the cache till the RAM is full
massive buffering of the read/write, to stop getting the two into each others way
Thanks in advance
Truly random access across a huge file is the worst possible case for any cache algorithm. It may be best to turn off as much caching as possible.
There are multiple levels of caching:
the CRT library (since you're using the f- functions)
the OS and filesystem
probably onboard the drive itself
If you replace your I/O calls via the f- functions in the CRT with the comparable ones in the Windows API (e.g., CreateFile, ReadFile, etc.) you can eliminate the CRT caching, which may be doing more harm than good. You can also warn the OS that you're going to be doing random accesses, which affects its caching strategy. See options like FILE_FLAG_RANDOM_ACCESS and possibly FILE_FLAG_NO_BUFFERING.
You'll need to experiment and measure.
You might also have to reconsider how your algorithm works. Are the seeks truly random? Can you re-sequence them, perhaps in batches, so that they're in order? Can you limit access to a relatively small region of the file at a time? Can you break the huge file into smaller files and then work with one piece at a time? Have you checked the level of fragmentation on the drive and on the particular file?
Depending on the larger picture of what your application does, you could possibly take a different approach - maybe something like this:
decide which lines you need from the input file and store the
line numbers in a list
sort the list of line numbers
read through the input file once, in order, and pull out the lines
you need (better yet, seek to next line and grab it, especially when there's big gaps)
if the list of lines you're grabbing is small enough, you can store
them in memory for reordering before output, otherwise, stick them
in a smaller temporary file and use that file as input for your
current algorithm to reorder the lines for final output
It's definitely a more complex approach, but it would be much kinder to your caching subsystem, and as a result, could potentially perform significantly better.

How to read a large file in UNIX/LINUX with limited RAM?

I have a large text file to be opened (eg- 5GB size). But with a limited RAM (take 1 GB), How can I open and read the file with out any memory error? I am running on a linux terminal with with the basic packages installed.
This was an interview question, hence please do not look into the practicality.
I do not know whether to look at it in System level or programmatic level... It would be great if someone can throw some light into this issue.
Thanks.
Read it character by character... or X bytes by X bytes... it really depends what you want to do with it... As long as you don't need the whole file at once, that works.
(Ellipses are awesome)
What do they want you to do with the file? Are you looking for something? Extracting something? Sorting? This will affect your approach.
It may be sufficient to read the file line by line or character by character if you're looking for something. If you need to jump around the file or analyze sections of it, then most likely want to memory map it. Look up mmap(). Here's an short article on the subject:memory mapped i/o
[just comment]
If you are going to use system calls (open() and read()), then reading character by character will generate a lot of system calls that severely slow down your application. Even with the existence of the buffer cache (or disk file), system calls are expensive.
It is better to read block by block where block size "SHOULD" be more than 1MB. In case of 1MB block size, you will issue 5*1024 system calls.

Does fread fail for large files?

I have to analyze a 16 GB file. I am reading through the file sequentially using fread() and fseek(). Is it feasible? Will fread() work for such a large file?
You don't mention a language, so I'm going to assume C.
I don't see any problems with fread, but fseek and ftell may have issues.
Those functions use long int as the data type to hold the file position, rather than something intelligent like fpos_t or even size_t. This means that they can fail to work on a file over 2 GB, and can certainly fail on a 16 GB file.
You need to see how big long int is on your platform. If it's 64 bits, you're fine. If it's 32, you are likely to have problems when using ftell to measure distance from the start of the file.
Consider using fgetpos and fsetpos instead.
Thanks for the response. I figured out where I was going wrong. fseek() and ftell() do not work for files larger than 4GB. I used _fseeki64() and _ftelli64() and it is working fine now.
If implemented correctly this shouldn't be a problem. I assume by sequentially you mean you're looking at the file in discrete chunks and advancing your file pointer.
Check out http://www.computing.net/answers/programming/using-fread-with-a-large-file-/10254.html
It sounds like he was doing nearly the same thing as you.
It depends on what you want to do. If you want to read the whole 16GB of data in memory, then chances are that you'll run out of memory or application heap space.
Rather read the data chunk by chunk and do processing on those chunks (and free resources when done).
But, besides all this, decide which approach you want to do (using fread() or istream, etc.) and do some test cases to see which works better for you.
If you're on a POSIX-ish system, you'll need to make sure you've built your program with 64-bit file offset support. POSIX mandates (or at least allows, and most systems enforce this) the implementation to deny IO operations on files whose size don't fit in off_t, even if the only IO being performed is sequential with no seeking.
On Linux, this means you need to use -D_FILE_OFFSET_BITS=64 on the gcc command line.

Resources