I'm trying to feed an array with fscanf() while looping through a file containing a list of integers, n integers long. It seems that I need to use malloc and/or potentially realloc. I've heard that the malloc command takes a noticeable amount of execution time and that it's best to over-allocate. Would someone mind helping me out with understanding the building blocks of accomplishing this end?
Disclaimer: I'm new to C.
No, what you've heard is misleading (at least to me). malloc is a just a function, and usually a fast one.
Most of the time it does all of its job in user-space. It "overallocates" so you don't have to
The bookkeeping (the linked list with free blocks etc.) is highly optimized since virtually everyone uses malloc
It's unrealistic to think you can easily beat malloc at this game. I am sorry if this doesn't answer your question (which was pretty general) but you have to realize there is no (spoon) optimization you can easily implement.
Reading the file will be much slower than allocating the memory!
You may want to read the whole file and find out how many entires you want and then malloc() all in one go.
malloc(sizeof(int)*n)
Premature optimization is the root of all evil (google it).
That said, allocate whatever amount you guess is reasonable/typical for the task at hand, and double it whenever you have to realloc. This strategy is rather hard to beat.
For your specific case, malloc isn't going to cause you issues. The run time of fscanf will be many, many times slower than the overhead of malloc and free. But, it can add up in high performance areas of an app. In these areas, there are other ways such as mem pools and fixed size allocators than can combat malloc()'s overhead. But, you are no where near needing to worry about performance overhead when you are just starting out.
Note that malloc() adds some overhead to each allocation to maintain its internal data structures (at least 4 bytes in common implementations), so if you integers are 4 bytes long, doing a malloc() for each integer would have >= 50% overhead (probably 75%). This would be the equivalent of using an array of Integer's in Java, instead of an array of int's.
As #Charles Dowd said, it's much better to allocate all the memory in one go, to avoid overhead.
You don't want to call malloc or realloc with every integer read, that's for sure. Can you estimate how much space you will need? Do you control the file format? If so, you could have the first line of the file be a single integer that denotes how many integers are to be read from the file. Then you could allocate all the space you need in one go. If you don't control the format and can't do this, follow the other suggestion mentioned in this thread: allocate a reasonably-sized buffer, and double it every time you run out of space.
It's a text file (not binary) and not in a fixed format, right? Otherwise it would be easy to calculate the size of the array from the file size ( buffer_size = file_size / record_size , buffersize is in words (the size of an int), the other sizes is in bytes).
This is what I would do (but I'm a bit of a nutter when it comes to applied statistics).
1) What is the maximum number of characters (a.k.a. bytes) a number (a.k.a. record) will occupy in the file, don't forget to include the end-of-line characters (CR, NF) and other blank glyphs (spaces, tabs et.c.)? If you already can estimate what the average size of a record would be, then it is even better, you use that instead of the maximum size.
initial_buffer_size = file_size / max_record_size + 1 (/ is integer division)
2) Allocate that buffer, read your integers into that buffer until it is full. If the whole file is read then you are finished, otherwise resize or reallocate buffer to meet your new estimated needs.
resize_size =
prev_buffer_size
+ bytes_not_read / ( bytes_already_read / number_of_records_already_read )
+ 1
3) Read into that buffer (from where the previous reading ended) until it is full, or all the of file has been read.
4) If not finished, repeat from step 2) with the new prev_buffer_size.
This will work best if the numbers (records) are totally randomly distributed from a byte-size point of view. If not, and if you have a clue what kind of distribution they have, you can adjust the algorithm according to that.
Related
For example, the program grabs the line: Hello World! and assigns the string to a dynamic array.
The length of each line is unknown and I want compatibility for all sizes.
getline() is the obvious answer here like Barmar suggested, but fgets() is also an option (see https://en.wikibooks.org/wiki/C_Programming/stdio.h/fgets).
But from what I understand, you don't know its size yet you want to put it into a perfectly sized dynamic array right off the bat? That's gonna take some crafty thinking and is difficult with a compiled language. The only way I can think of off the top of my head is quite long in time of execution; open the file twice, once to get each line size, and next time to read in each line of the file with the given file size after malloc'ing the correct number of bytes, and storing pointers to these dynamic arrays in a list. This is going to take a lot longer to execute, so if you're not limited on CPU power, this may be an option.
Normally, you'd just know what maximum size to expect and have the array defined at that maximum size. In the grand scheme of things, an extra 50 bytes isn't gonna hurt anything... which hurts me as an embedded guy to say that, but computers have large enough memory these days...
I have a text file and I should allocate an array with as many entries as the number of lines in the file. What's more efficient: to read the file twice (first to find out the number of lines) and allocate the array once, or to read the file once, and use "realloc" after each line read? thank you in advance.
Reading the file twice is a bad idea, regardless of efficiency. (It's also almost certainly less efficient.)
If your application insists on reading its input teice, that means its input must be rewindable, which excludes terminal input and pipes. That's a limitation so annoying that apps which really need to read their input more than once (like sort) generally have logic to make a temporary copy if the input is unseekable.
In this case, you are only trying to avoid the trivial overhead of a few extra malloc calls. That's not justification to limit the application's input options.
If that's not convincing enough, imagine what will happen if someone appends to the file between the first time you read it and the second time. If your implementation trusts the count it got on the first read, it will overrun the vector of line pointers on the second read, leading to Undefined Behaviour and a potential security vulnerability.
I presume you want to store the read lines also and not just allocate an array of that many entries.
Also that you don't want to change the lines and then write them back as in that case you might be better off using mmap.
Reading a file twice is always bad, even if it is cached the 2nd time, too many system calls are needed. Also allocing every line separately if a waste of time if you don't need to dealloc them in a random order.
Instead read the entire file at once, into an allocated area.
Find the number of lines by finding line feeds.
Alloc an array
Put the start pointers into the array by finding the same line feeds again.
If you need it as strings, then replace the line feed with \0
This might also be improved upon on modern cpu-architectures, instead of reading the array twice it might be faster simply allocating a "large enough" array for the pointer and scan the array once. This will cause a realloc at the end to have the right size and potentially a couple of times to make the array larger if it wasn't large enough at start.
Why is this faster? because you have a lot of if's that can take a lot of time for each line. So its better to only have to do this once, the cost is the reallocation, but copying large arrays with memcpy can be a bit cheaper.
But you have to measure it, your system settings, buffer sizes etc. will influence things too.
The answer to "What's more efficient/faster/better? ..." is always:
Try each one on the system you're going to use it on, measure your results accurately, and find out.
The term is "benchmarking".
Anything else is a guess.
I want to ask you if is it possible to read the same input (stdin) multiple times? I am about to get really big number, containing thousands of digits (so I am unable to store it in variable, (and also I can not use folders!). My idea is to put the digits into int array, but I don't know how big the array should be, because amount of digits in input may vary. I have to write general solution.
So my question is, how to solve it, and how to find out amount of digits (so I can initialize array), before I copy digits into array. I tried using scanf(), multiple times, or scanf() and getchar, but it is not working. See my code:
int main(){
int c;
int amountOfDigits=5;
while(scanf("%1d",&c)!=' '){//finding out number of digits with scanf
if(isdigit(c)==0){
break;
}
amountOfDigits++;
}
int digits[amountOfDigits];//now i know lenght of array, and initialize it
for(int i=0;i<amountOfDigits;i++){//putting digits into array
digits[i]=getchar();
}
for(int i=0;i<amountOfDigits;i++){//printing array
printf("%d",digits[i]);
}
printf("\n");
return 0;
}
is it possible to read the same input (stdin) multiple times?
(I am guessing you are a student beginning to learn programming, and you are using Linux; adapt my answer if not)
For your homework, you don't need to read the same input several times. In some cases, it is possible (when the standard input is a genuine file -seekable-, that is when you use some redirection in your command). In other cases (e.g. when the standard input is a pipe, e.g. with a command pipeline; or with here documents in your shell command...) it is not possible to read several times stdin (but you don't need to). In general, don't expect stdin to be seekable with fseek or rewind (it usually is not).
(I am not going to do your homework, but here are useful hints)
so I am unable to store it in variable, (and also I can not use folders!)
You could do several things:
(since you mentioned folders....) you might use some more sophisticated ways of storing data on the disk (but in your particular case, I don't recommend that ...). These ways could be some direct-accessed file (ugly), or some indexed file à la gdbm, or some database à la sqlite or even some RDBMS server like PostGreSQL.
In your case, you don't need any of these; I'm mentioning it since you mentioned "folders" and you meant "directories"!
you really should use some heap allocated memory, so read about C dynamic memory allocation and read carefully the documentation of each standard memory management functions like malloc, realloc, free. Your program should probably use all these three functions (don't forget that malloc & realloc could fail).
Read this and that answers. Both are surprisingly relevant.
You probably should keep somehow:
a pointer to heap allocated int-s (actually, you could use char-s)
the allocated size of that pointer
the used length of that thing, that is the actual number of useful digits.
You certainly don't want to grow your array by repeated realloc at each loop (that is inefficient). In practice, you would adapt some growing scheme like newsize = 3*oldsize/2 + 10 to avoid reallocating memory at each step (of your input loop).
you should thank your teacher for a so useful exercise, but you should not expect StackOverflow to do your homework!
Be also aware of arbitrary-precision arithmetic (called bignums or bigints). It is actually hard to code efficiently, so in real-life you would use some library like GMPlib.
I want to develop an application in C where I need to check word by word from a file on disk. I've been told that reading a line from file and then splitting it into words is more efficient as less file accesses are required. Is it true?
If you know you're going to need the entire file, you may as well be reading it in as large chunks as you can (at the extreme end, you'll memory map the entire file into memory in one go). You are right that this is because less file accesses are needed.
But if your program is not slow, then write it in the way that makes it the fastest and most bug free for you to develop. Early optimization is a grievous sin.
Not really true, assuming you're going to be using scanf() and your definition of 'word' matches what scanf() treats as a word.
The standard I/O library will buffer the actual disk reads, and reading a line or a word will have essentially the same I/O cost in terms of disk accesses. If you were to read big chunks of a file using fread(), you might get some benefit — at a cost in complexity.
But for reading words, it's likely that scanf() and a protective string format specifier such as %99s if your array is char word[100]; would work fine and is probably simpler to code.
If your definition of word is more complex than the definition supported by scanf(), then reading lines and splitting is probably easier.
As far as splitting is concerned there is no difference with respect to performance. You are splitting using whitespace in one case and newline in another.
However it would impact in case of word in a way that you would need to allocate buffers M times, while in case of lines it will be N times, where M>N. So if you are adopting word split approach, try to calculate total memory need first, allocate that much chunk (so you don't end up with fragmented M chunks), and later get M buffers from that chunk. Note that same approach can be applied in lines split but the difference will be less visible.
This is correct, you should read them in to a buffer, and then split into whatever you define as 'words'.
The only case where this would not be true is if you can get fscanf() to correctly grab out what you consider to be words (doubtful).
The major performance bottlenecks will likely be:
Any call to a stdio file I/O function. The less calls, the less overhead.
Dynamic memory allocation. Should be done as scarcely as possible. Ultimately, a lot of calls to malloc will cause heap segmentation.
So what it boils down to is a classic programming consideration: you can get either quick execution time or you can get low memory usage. You can't get both, but you can find some suitable middle-ground that is most effective both in terms of execution time and memory consumption.
To one extreme, the fastest possible execution can be obtained by reading the whole file as one big chunk and upload it to dynamic memory. Or to the other extreme, you can read it byte by byte and evaluate it as you read, which might make the program slower but will not use dynamic memory at all.
You will need a fundamental knowledge of various CPU-specific and OS-specific features to optimize the code most effectively. Issues like alignment, cache memory layout, the effectiveness of the underlying API function calls etc etc will all matter.
Why not try a few different ways and benchmark them?
Not actually answer to your exact question (words vs lines), but if you need all words in memory at the same time anyway, then the most efficient approach is this:
determine file size
allocate buffer for entire file plus one byte
read entire file to the buffer, and put '\0' to the extra byte.
make a pass over it and count how many words it has
allocate char* (pointers to words) or int (indexes to buffer) index array, with size matching word count
make 2nd pass over buffer, and store addresses or indexes to the first letters of words to the index array, and overwrite other bytes in buffer with '\0' (end of string marker).
If you have plenty of memory, then it's probably slightly faster to just assume the worst case for number of words: (filesize+1) / 2 (one letter words with one space in between, with odd number of bytes in file). Also taking the Java ArrayList or Qt QVector approach with the index array, and using realloc() to double it's size when word count exceeds current capacity, will be quite efficient (due to doubling=exponential growth, reallocation will not happen many times).
As I loop through lines in file A, I am parsing the line and putting each string (char*) into a char**.
At the end of a line, I then run a procedure that consists of opening file B, using fgets, fseek and fgetc to grab characters from that file. I then close file B.
I repeat reopening and reclosing file B for each line.
What I would like to know is:
Is there a significant performance hit from using malloc and free, such that I should use something static like myArray[NUM_STRINGS][MAX_STRING_WIDTH] instead of a dynamic char** myArray?
Is there significant performance overhead from opening and closing file B (conceptually, many thousands of times)? If my file A is sorted, is there a way for me to use fseek to move "backwards" in file B, to reset where I was previously located in file B?
EDIT Turns out that a two-fold approach greatly reduced the runtime:
My file B is actually one of twenty-four files. Instead of opening up the same file B1 a thousand times, and then B2 a thousand times, etc. I open up file B1 once, close it, B2 once, close it, etc. This reduces many thousands of fopen and fclose operations to roughly 24.
I used rewind() to reset the file pointer.
This yielded a roughly 60-fold speed improvement, which is more than sufficient. Thanks for pointing me to rewind().
If your dynamic array grows in time, there is a copy cost on some reallocs. If you use the "always double" heuristic, this is amortized to O(n), so it is not horrible. If you know the size ahead of time, a stack allocated array will still be faster.
For the second question read about rewind. It has got to be faster than opening and closing all the time, and lets you do less resource management.
What I would like to know is:
does your code work correctly?
is it running fast enough for your purpose?
If the answer both of these is "yes", don't change anything.
Opening and closing has a variable overhead depending on if other programs are competitng for that resource.
measure the file size first and then use that to calculate the array size in advance to do one big heap allocation.
You won't get a multi-dimensional array right off, but a bit of pointer arithmetic and you are there.
Can you not cache positional information in the other file and then, rather than opening and closing it, use previous seek indexes as an offset? Depends on the exact logic really.
If your files are large, disk I/O will be far more expensive than memory management. Worrying about malloc/free performance before profiling indicates that it is a bottleneck is premature optimization.
It is possible that the overhead from frequent open/close is significant in your program, but again the actual I/O is likely to be more expensive, unless the files are small, in which case the loss of buffers between close and open can potentially cause extra disk I/O. And yes you can use ftell() to get the current position in a file then fseek with SEEK_SET to get to that.
There is always a performance hit with using dynamic memory. Using a static buffer will provide a speed boost.
There is also going to be a performance hit with reopening a file. You can use fseek(pos, SEEK_SET) to set the file pointer to any position in the file or fseek(offset, SEEK_CUR) to do a relative move.
Significant performance hit is relative, and you will have to determine what that means for yourself.
I think it's better to allocate the
actual space you need, and the
overhead will probably not be
significant. This avoids both
wasting space and stack overflows
Yes. Though the IO is cached,
you're making unnecessary syscalls
(open and close). Use fseek with
probably SEEK_CUR or SEEK_SET.
In both cases, there is some performance hit, but the significance will depend on the size of the files and the context your program runs in.
If you actually know the max number of strings and max width, this will be a lot faster (but you may waste a lot of memory if you use less than the "max"). The happy medium is to do what a lot of dynamic array implementations in C++ do: whenever you have to realloc myArray, alloc twice as much space as you need, and only realloc again once you've run out of space. This has O(log n) performance cost.
This may be a big performance hit. I strongly recommend using fseek, though the details will depend on your algorithm.
I often find the performance overhead to be outweighed by the direct memory management that comes with malloc and those low-level C handlers on memory. Unless these areas of memory are going to remain static and untouched for an amount of time that is in amortized time greater than touching this memory, it may be more beneficial to stick with the static array. In the end, it's up to you.