C malloc/free + fgets performance - c

As I loop through lines in file A, I am parsing the line and putting each string (char*) into a char**.
At the end of a line, I then run a procedure that consists of opening file B, using fgets, fseek and fgetc to grab characters from that file. I then close file B.
I repeat reopening and reclosing file B for each line.
What I would like to know is:
Is there a significant performance hit from using malloc and free, such that I should use something static like myArray[NUM_STRINGS][MAX_STRING_WIDTH] instead of a dynamic char** myArray?
Is there significant performance overhead from opening and closing file B (conceptually, many thousands of times)? If my file A is sorted, is there a way for me to use fseek to move "backwards" in file B, to reset where I was previously located in file B?
EDIT Turns out that a two-fold approach greatly reduced the runtime:
My file B is actually one of twenty-four files. Instead of opening up the same file B1 a thousand times, and then B2 a thousand times, etc. I open up file B1 once, close it, B2 once, close it, etc. This reduces many thousands of fopen and fclose operations to roughly 24.
I used rewind() to reset the file pointer.
This yielded a roughly 60-fold speed improvement, which is more than sufficient. Thanks for pointing me to rewind().

If your dynamic array grows in time, there is a copy cost on some reallocs. If you use the "always double" heuristic, this is amortized to O(n), so it is not horrible. If you know the size ahead of time, a stack allocated array will still be faster.
For the second question read about rewind. It has got to be faster than opening and closing all the time, and lets you do less resource management.

What I would like to know is:
does your code work correctly?
is it running fast enough for your purpose?
If the answer both of these is "yes", don't change anything.

Opening and closing has a variable overhead depending on if other programs are competitng for that resource.
measure the file size first and then use that to calculate the array size in advance to do one big heap allocation.
You won't get a multi-dimensional array right off, but a bit of pointer arithmetic and you are there.
Can you not cache positional information in the other file and then, rather than opening and closing it, use previous seek indexes as an offset? Depends on the exact logic really.

If your files are large, disk I/O will be far more expensive than memory management. Worrying about malloc/free performance before profiling indicates that it is a bottleneck is premature optimization.
It is possible that the overhead from frequent open/close is significant in your program, but again the actual I/O is likely to be more expensive, unless the files are small, in which case the loss of buffers between close and open can potentially cause extra disk I/O. And yes you can use ftell() to get the current position in a file then fseek with SEEK_SET to get to that.

There is always a performance hit with using dynamic memory. Using a static buffer will provide a speed boost.
There is also going to be a performance hit with reopening a file. You can use fseek(pos, SEEK_SET) to set the file pointer to any position in the file or fseek(offset, SEEK_CUR) to do a relative move.
Significant performance hit is relative, and you will have to determine what that means for yourself.

I think it's better to allocate the
actual space you need, and the
overhead will probably not be
significant. This avoids both
wasting space and stack overflows
Yes. Though the IO is cached,
you're making unnecessary syscalls
(open and close). Use fseek with
probably SEEK_CUR or SEEK_SET.

In both cases, there is some performance hit, but the significance will depend on the size of the files and the context your program runs in.
If you actually know the max number of strings and max width, this will be a lot faster (but you may waste a lot of memory if you use less than the "max"). The happy medium is to do what a lot of dynamic array implementations in C++ do: whenever you have to realloc myArray, alloc twice as much space as you need, and only realloc again once you've run out of space. This has O(log n) performance cost.
This may be a big performance hit. I strongly recommend using fseek, though the details will depend on your algorithm.

I often find the performance overhead to be outweighed by the direct memory management that comes with malloc and those low-level C handlers on memory. Unless these areas of memory are going to remain static and untouched for an amount of time that is in amortized time greater than touching this memory, it may be more beneficial to stick with the static array. In the end, it's up to you.

Related

what's more efficient: reading from a file or allocating memory

I have a text file and I should allocate an array with as many entries as the number of lines in the file. What's more efficient: to read the file twice (first to find out the number of lines) and allocate the array once, or to read the file once, and use "realloc" after each line read? thank you in advance.
Reading the file twice is a bad idea, regardless of efficiency. (It's also almost certainly less efficient.)
If your application insists on reading its input teice, that means its input must be rewindable, which excludes terminal input and pipes. That's a limitation so annoying that apps which really need to read their input more than once (like sort) generally have logic to make a temporary copy if the input is unseekable.
In this case, you are only trying to avoid the trivial overhead of a few extra malloc calls. That's not justification to limit the application's input options.
If that's not convincing enough, imagine what will happen if someone appends to the file between the first time you read it and the second time. If your implementation trusts the count it got on the first read, it will overrun the vector of line pointers on the second read, leading to Undefined Behaviour and a potential security vulnerability.
I presume you want to store the read lines also and not just allocate an array of that many entries.
Also that you don't want to change the lines and then write them back as in that case you might be better off using mmap.
Reading a file twice is always bad, even if it is cached the 2nd time, too many system calls are needed. Also allocing every line separately if a waste of time if you don't need to dealloc them in a random order.
Instead read the entire file at once, into an allocated area.
Find the number of lines by finding line feeds.
Alloc an array
Put the start pointers into the array by finding the same line feeds again.
If you need it as strings, then replace the line feed with \0
This might also be improved upon on modern cpu-architectures, instead of reading the array twice it might be faster simply allocating a "large enough" array for the pointer and scan the array once. This will cause a realloc at the end to have the right size and potentially a couple of times to make the array larger if it wasn't large enough at start.
Why is this faster? because you have a lot of if's that can take a lot of time for each line. So its better to only have to do this once, the cost is the reallocation, but copying large arrays with memcpy can be a bit cheaper.
But you have to measure it, your system settings, buffer sizes etc. will influence things too.
The answer to "What's more efficient/faster/better? ..." is always:
Try each one on the system you're going to use it on, measure your results accurately, and find out.
The term is "benchmarking".
Anything else is a guess.

How to buffer a line in a file by using System Calls in C?

Here is my approach:
int linesize=1
int ReadStatus;
char buff[200];
ReadStatus=read(file,buff,linesize)
while(buff[linesize-1]!='\n' && ReadStatus!=0)
{
linesize++;
ReadStatus=read(file,buf,linesize)
}
Is this idea right?
I think my code is a bit inefficient because the run time is O(FileWidth); however I think it can be O(log(FileWidth)) if we exponentially increase linesize to find the linefeed character.
What do you think?
....
I just saw a new problem. How do we read the second line?. Is there anyway to delimit the bytes?
Is this idea right?
No. At the heart of a comment written by Siguza, lies the summary of an issue:
1) read doesn't read lines, it just reads bytes. There's no reason buff should end with \n.
Additionally, there's no reason buff shouldn't contain multiple newline characters, and as there's no [posix] tag here there's no reason to suggest what read does, let alone whether it's a syscall. Assuming you're referring to the POSIX function, there's no error handling. Where's your logic to handle the return value/s reserved for errors?
I think my code is a bit inefficient because the run time is O(FileWidth); however I think it can be O(log(FileWidth)) if we exponentially increase linesize to find the linefeed character.
Providing you fix the issues mentioned above (more on that later), if you were to test this theory, you'd likely find, also at the heart of the comment by Siguza,
Disks usually work on a 512-byte basis and file system caches and even CPU/memory caches are a lot larger than that.
To an extent, you can expect your idea to approach O(log n), but your bottleneck will be one of those cache lines (likely the one closest to your keyboard/the filesystem/whatever is feeding the stream with information). At that point, you should stop guzzling memory which other programs might need because your optimisation becomes less and less effective.
What do you think?
I think you should just STOP! You're guessing!
Once you've written your program, decide whether or not it's too slow. If it's not too slow, it doesn't need optimisation, and you probably won't shave enough nanoseconds to make optimisation worthwhile.
If it is to slow, then you should:
Use a profiler to determine what the most significant bottleneck is,
apply optimisations based on what your profiler tells you, then
use your profiler again, with the same inputs as before, to measure the effect your optimisation had.
If you don't use a profiler, your guess-work could result in slower code, or you might miss opportunities for more significant optimisations...
How do we read the second line?
Naturally, it makes sense to read character by character, rather than two hundred characters at a time, because there's no other way to stop reading the moment you reach a line terminating character.
Is there anyway to delimit the bytes?
Yes. The most sensible tools to use are provided by the C standard, and syscalls are managed automatically to be most efficient based on configurations decided by the standard library devs (who are much likely better at this than you are). Those tools are:
fgets to attempt to read a line (by reading one character at a time), up to a threshold (the size of your buffer). You get to decide how large a line should be, because it's more often the case that you won't expect a user/program to input huge lines.
strchr or strcspn to detect newlines from within your buffer, in order to determine whether you read a complete line.
scanf("%*[^\n]"); to discard the remainder of an incomplete line, when you detect those.
realloc to reallocate your buffer, if you decide you want to resize it and call fgets a second time to retrieve more data rather than discarding the remainder. Note: this will have an effect on the runtime complexity of your code, not that I think you should care about that...
Other options are available for the first three. You could use fgetc (or even read one character at a time) like I did at the end of this answer, for example...
In fact, that answer is highly relevant to your question, as it does make an attempt to exponentially increase the size. I wrote another example of this here.
It should be pointed out that the reason to address these problems is not so much optimisation, but the need to read a large, yet variadic in size chunk of memory. Remember, if you haven't yet written the code, it's likely you won't know whether it's worthwhile optimising it!
Suffice to say, it isn't the read function you should try to reduce your dependence upon, but the malloc/realloc/calloc function... That's the real kicker! If you don't absolutely need to store the entire line, then don't!

Fortran: How do I allocate arrays when reading a file of unknown size?

My typical use of Fortran begins with reading in a file of unknown size (usually 5-100MB). My current approach to array allocation involves reading the file twice. First to determine the size of the problem (to allocate arrays) and a second time to read the data into those arrays.
Are there better approaches to size determination/array allocation? I just read about automatic array allocation (example below) in another post that seemed much easier.
array = [array,new_data]
What are all the options and their pros and cons?
I'll bite, though the question is teetering close to off-topicality. Your options are:
Read the file once to get the array size, allocate, read again.
Read piece-by-piece, (re-)allocating as you go. Choose the size of piece to read as you wish (or, perhaps, as you think is likely to be most speedy for your case).
Always, always, work with files which contain metadata to tell an interested program how much data there is; for example a block
header line telling you how many data elements are in the next
block.
Option 3 is the best by far. A little extra thought, and about one whole line of code, at the beginning of a project and so much wasted time and effort saved down the line. You don't have to jump on HDF5 or a similar heavyweight file design method, just adopt enough discipline to last the useful life of the contents of the file. For iteration-by-iteration dumps from your simulation of the universe, a home-brewed approach will do (be honest, you're the only person who's ever going to look at them). For data gathered at an approximate cost of $1M per TB (satellite observations, offshore seismic traces, etc) then HDF5 or something similar.
Option 1 is fine too. It's not like you have to wait for the tapes to rewind between reads any more. (Well, some do, but they're in a niche these days, and a de-archiving system will often move files from tape to disk if they're to be used.)
Option 2 is a faff. It may also be the worst performing but on all but the largest files the worst performance may be within a nano-century of the best. If that's important to you then check it out.
If you want quantification of my opinions run your own experiments on your files on your hardware.
PS I haven't really got a clue how much it costs to get 1TB of satellite or seismic data, it's a factoid invented to support an argument.
I would add to the previous answer:
If your data has a regular structure and it's possible to open it in a txt file, press ctrl+end substract header to the rows total and there it is. Although you may waste time opening it if it's very large.

What is more efficient, reading word by word from file or reading a line at a time and splitting the string using C ?

I want to develop an application in C where I need to check word by word from a file on disk. I've been told that reading a line from file and then splitting it into words is more efficient as less file accesses are required. Is it true?
If you know you're going to need the entire file, you may as well be reading it in as large chunks as you can (at the extreme end, you'll memory map the entire file into memory in one go). You are right that this is because less file accesses are needed.
But if your program is not slow, then write it in the way that makes it the fastest and most bug free for you to develop. Early optimization is a grievous sin.
Not really true, assuming you're going to be using scanf() and your definition of 'word' matches what scanf() treats as a word.
The standard I/O library will buffer the actual disk reads, and reading a line or a word will have essentially the same I/O cost in terms of disk accesses. If you were to read big chunks of a file using fread(), you might get some benefit — at a cost in complexity.
But for reading words, it's likely that scanf() and a protective string format specifier such as %99s if your array is char word[100]; would work fine and is probably simpler to code.
If your definition of word is more complex than the definition supported by scanf(), then reading lines and splitting is probably easier.
As far as splitting is concerned there is no difference with respect to performance. You are splitting using whitespace in one case and newline in another.
However it would impact in case of word in a way that you would need to allocate buffers M times, while in case of lines it will be N times, where M>N. So if you are adopting word split approach, try to calculate total memory need first, allocate that much chunk (so you don't end up with fragmented M chunks), and later get M buffers from that chunk. Note that same approach can be applied in lines split but the difference will be less visible.
This is correct, you should read them in to a buffer, and then split into whatever you define as 'words'.
The only case where this would not be true is if you can get fscanf() to correctly grab out what you consider to be words (doubtful).
The major performance bottlenecks will likely be:
Any call to a stdio file I/O function. The less calls, the less overhead.
Dynamic memory allocation. Should be done as scarcely as possible. Ultimately, a lot of calls to malloc will cause heap segmentation.
So what it boils down to is a classic programming consideration: you can get either quick execution time or you can get low memory usage. You can't get both, but you can find some suitable middle-ground that is most effective both in terms of execution time and memory consumption.
To one extreme, the fastest possible execution can be obtained by reading the whole file as one big chunk and upload it to dynamic memory. Or to the other extreme, you can read it byte by byte and evaluate it as you read, which might make the program slower but will not use dynamic memory at all.
You will need a fundamental knowledge of various CPU-specific and OS-specific features to optimize the code most effectively. Issues like alignment, cache memory layout, the effectiveness of the underlying API function calls etc etc will all matter.
Why not try a few different ways and benchmark them?
Not actually answer to your exact question (words vs lines), but if you need all words in memory at the same time anyway, then the most efficient approach is this:
determine file size
allocate buffer for entire file plus one byte
read entire file to the buffer, and put '\0' to the extra byte.
make a pass over it and count how many words it has
allocate char* (pointers to words) or int (indexes to buffer) index array, with size matching word count
make 2nd pass over buffer, and store addresses or indexes to the first letters of words to the index array, and overwrite other bytes in buffer with '\0' (end of string marker).
If you have plenty of memory, then it's probably slightly faster to just assume the worst case for number of words: (filesize+1) / 2 (one letter words with one space in between, with odd number of bytes in file). Also taking the Java ArrayList or Qt QVector approach with the index array, and using realloc() to double it's size when word count exceeds current capacity, will be quite efficient (due to doubling=exponential growth, reallocation will not happen many times).

C - Dynamic Array

I'm trying to feed an array with fscanf() while looping through a file containing a list of integers, n integers long. It seems that I need to use malloc and/or potentially realloc. I've heard that the malloc command takes a noticeable amount of execution time and that it's best to over-allocate. Would someone mind helping me out with understanding the building blocks of accomplishing this end?
Disclaimer: I'm new to C.
No, what you've heard is misleading (at least to me). malloc is a just a function, and usually a fast one.
Most of the time it does all of its job in user-space. It "overallocates" so you don't have to
The bookkeeping (the linked list with free blocks etc.) is highly optimized since virtually everyone uses malloc
It's unrealistic to think you can easily beat malloc at this game. I am sorry if this doesn't answer your question (which was pretty general) but you have to realize there is no (spoon) optimization you can easily implement.
Reading the file will be much slower than allocating the memory!
You may want to read the whole file and find out how many entires you want and then malloc() all in one go.
malloc(sizeof(int)*n)
Premature optimization is the root of all evil (google it).
That said, allocate whatever amount you guess is reasonable/typical for the task at hand, and double it whenever you have to realloc. This strategy is rather hard to beat.
For your specific case, malloc isn't going to cause you issues. The run time of fscanf will be many, many times slower than the overhead of malloc and free. But, it can add up in high performance areas of an app. In these areas, there are other ways such as mem pools and fixed size allocators than can combat malloc()'s overhead. But, you are no where near needing to worry about performance overhead when you are just starting out.
Note that malloc() adds some overhead to each allocation to maintain its internal data structures (at least 4 bytes in common implementations), so if you integers are 4 bytes long, doing a malloc() for each integer would have >= 50% overhead (probably 75%). This would be the equivalent of using an array of Integer's in Java, instead of an array of int's.
As #Charles Dowd said, it's much better to allocate all the memory in one go, to avoid overhead.
You don't want to call malloc or realloc with every integer read, that's for sure. Can you estimate how much space you will need? Do you control the file format? If so, you could have the first line of the file be a single integer that denotes how many integers are to be read from the file. Then you could allocate all the space you need in one go. If you don't control the format and can't do this, follow the other suggestion mentioned in this thread: allocate a reasonably-sized buffer, and double it every time you run out of space.
It's a text file (not binary) and not in a fixed format, right? Otherwise it would be easy to calculate the size of the array from the file size ( buffer_size = file_size / record_size , buffersize is in words (the size of an int), the other sizes is in bytes).
This is what I would do (but I'm a bit of a nutter when it comes to applied statistics).
1) What is the maximum number of characters (a.k.a. bytes) a number (a.k.a. record) will occupy in the file, don't forget to include the end-of-line characters (CR, NF) and other blank glyphs (spaces, tabs et.c.)? If you already can estimate what the average size of a record would be, then it is even better, you use that instead of the maximum size.
initial_buffer_size = file_size / max_record_size + 1 (/ is integer division)
2) Allocate that buffer, read your integers into that buffer until it is full. If the whole file is read then you are finished, otherwise resize or reallocate buffer to meet your new estimated needs.
resize_size =
prev_buffer_size
+ bytes_not_read / ( bytes_already_read / number_of_records_already_read )
+ 1
3) Read into that buffer (from where the previous reading ended) until it is full, or all the of file has been read.
4) If not finished, repeat from step 2) with the new prev_buffer_size.
This will work best if the numbers (records) are totally randomly distributed from a byte-size point of view. If not, and if you have a clue what kind of distribution they have, you can adjust the algorithm according to that.

Resources