I want to ask you if is it possible to read the same input (stdin) multiple times? I am about to get really big number, containing thousands of digits (so I am unable to store it in variable, (and also I can not use folders!). My idea is to put the digits into int array, but I don't know how big the array should be, because amount of digits in input may vary. I have to write general solution.
So my question is, how to solve it, and how to find out amount of digits (so I can initialize array), before I copy digits into array. I tried using scanf(), multiple times, or scanf() and getchar, but it is not working. See my code:
int main(){
int c;
int amountOfDigits=5;
while(scanf("%1d",&c)!=' '){//finding out number of digits with scanf
if(isdigit(c)==0){
break;
}
amountOfDigits++;
}
int digits[amountOfDigits];//now i know lenght of array, and initialize it
for(int i=0;i<amountOfDigits;i++){//putting digits into array
digits[i]=getchar();
}
for(int i=0;i<amountOfDigits;i++){//printing array
printf("%d",digits[i]);
}
printf("\n");
return 0;
}
is it possible to read the same input (stdin) multiple times?
(I am guessing you are a student beginning to learn programming, and you are using Linux; adapt my answer if not)
For your homework, you don't need to read the same input several times. In some cases, it is possible (when the standard input is a genuine file -seekable-, that is when you use some redirection in your command). In other cases (e.g. when the standard input is a pipe, e.g. with a command pipeline; or with here documents in your shell command...) it is not possible to read several times stdin (but you don't need to). In general, don't expect stdin to be seekable with fseek or rewind (it usually is not).
(I am not going to do your homework, but here are useful hints)
so I am unable to store it in variable, (and also I can not use folders!)
You could do several things:
(since you mentioned folders....) you might use some more sophisticated ways of storing data on the disk (but in your particular case, I don't recommend that ...). These ways could be some direct-accessed file (ugly), or some indexed file à la gdbm, or some database à la sqlite or even some RDBMS server like PostGreSQL.
In your case, you don't need any of these; I'm mentioning it since you mentioned "folders" and you meant "directories"!
you really should use some heap allocated memory, so read about C dynamic memory allocation and read carefully the documentation of each standard memory management functions like malloc, realloc, free. Your program should probably use all these three functions (don't forget that malloc & realloc could fail).
Read this and that answers. Both are surprisingly relevant.
You probably should keep somehow:
a pointer to heap allocated int-s (actually, you could use char-s)
the allocated size of that pointer
the used length of that thing, that is the actual number of useful digits.
You certainly don't want to grow your array by repeated realloc at each loop (that is inefficient). In practice, you would adapt some growing scheme like newsize = 3*oldsize/2 + 10 to avoid reallocating memory at each step (of your input loop).
you should thank your teacher for a so useful exercise, but you should not expect StackOverflow to do your homework!
Be also aware of arbitrary-precision arithmetic (called bignums or bigints). It is actually hard to code efficiently, so in real-life you would use some library like GMPlib.
Related
TL;DR:
I am asking you to tell me what would be the most efficient approach to double my strings and print them out?
Full story:
I had trouble with the title, and the actual problem may be a bit different than you expect.
Imagine I have a main buffer.
At some index determined by the program, I want to insert a string.
But every char in that string needs to be doubled.
So "abc", inserted at index 10 of buffer[999], needs to be "aabbcc".
Now, the second part of the problem - this needs to be as efficient as possible. I could make this easily, but I need the fastest option.
I thought I had devised several approaches, but it boils down to:
fill buffer(1000) with single chars and double the chars when printing (pushing to stdout)
fill buffer(2000) with double chars and print like normal
The variations to the second approach would be When to double the chars (when copying or generating "aabbcc" from the start and copy the full thing).
The first approach would be the most intuitive, but I fear I would need to devise a low level char-doubling function because putc and printf and any large amount of function calls will have much overhead. (There are allegedly very efficient functions in libc with bitshifting and pointer magic, but I couldn't find them. I can only find the very dissappointing versions where fgets() is just a wrapper for getc() - which can't be efficient.)
The second approach obviously wastes a lot of memory and requires a lot of copying, but it could probably put everything into stdout more efficiently as a chunk without the overhead of copying single chars.
I am unsure if under everything there is just a system write call, and I also lack the knowledge how it works. I am just going with my research that says that fgets is about 12 times faster than fgetc for equal data. And so I assume it is with all the single-char vs line functions.
So in conclusion, I am asking you to tell me what would be the most efficient approach to double my strings and print them out?
I have a text file and I should allocate an array with as many entries as the number of lines in the file. What's more efficient: to read the file twice (first to find out the number of lines) and allocate the array once, or to read the file once, and use "realloc" after each line read? thank you in advance.
Reading the file twice is a bad idea, regardless of efficiency. (It's also almost certainly less efficient.)
If your application insists on reading its input teice, that means its input must be rewindable, which excludes terminal input and pipes. That's a limitation so annoying that apps which really need to read their input more than once (like sort) generally have logic to make a temporary copy if the input is unseekable.
In this case, you are only trying to avoid the trivial overhead of a few extra malloc calls. That's not justification to limit the application's input options.
If that's not convincing enough, imagine what will happen if someone appends to the file between the first time you read it and the second time. If your implementation trusts the count it got on the first read, it will overrun the vector of line pointers on the second read, leading to Undefined Behaviour and a potential security vulnerability.
I presume you want to store the read lines also and not just allocate an array of that many entries.
Also that you don't want to change the lines and then write them back as in that case you might be better off using mmap.
Reading a file twice is always bad, even if it is cached the 2nd time, too many system calls are needed. Also allocing every line separately if a waste of time if you don't need to dealloc them in a random order.
Instead read the entire file at once, into an allocated area.
Find the number of lines by finding line feeds.
Alloc an array
Put the start pointers into the array by finding the same line feeds again.
If you need it as strings, then replace the line feed with \0
This might also be improved upon on modern cpu-architectures, instead of reading the array twice it might be faster simply allocating a "large enough" array for the pointer and scan the array once. This will cause a realloc at the end to have the right size and potentially a couple of times to make the array larger if it wasn't large enough at start.
Why is this faster? because you have a lot of if's that can take a lot of time for each line. So its better to only have to do this once, the cost is the reallocation, but copying large arrays with memcpy can be a bit cheaper.
But you have to measure it, your system settings, buffer sizes etc. will influence things too.
The answer to "What's more efficient/faster/better? ..." is always:
Try each one on the system you're going to use it on, measure your results accurately, and find out.
The term is "benchmarking".
Anything else is a guess.
I want to develop an application in C where I need to check word by word from a file on disk. I've been told that reading a line from file and then splitting it into words is more efficient as less file accesses are required. Is it true?
If you know you're going to need the entire file, you may as well be reading it in as large chunks as you can (at the extreme end, you'll memory map the entire file into memory in one go). You are right that this is because less file accesses are needed.
But if your program is not slow, then write it in the way that makes it the fastest and most bug free for you to develop. Early optimization is a grievous sin.
Not really true, assuming you're going to be using scanf() and your definition of 'word' matches what scanf() treats as a word.
The standard I/O library will buffer the actual disk reads, and reading a line or a word will have essentially the same I/O cost in terms of disk accesses. If you were to read big chunks of a file using fread(), you might get some benefit — at a cost in complexity.
But for reading words, it's likely that scanf() and a protective string format specifier such as %99s if your array is char word[100]; would work fine and is probably simpler to code.
If your definition of word is more complex than the definition supported by scanf(), then reading lines and splitting is probably easier.
As far as splitting is concerned there is no difference with respect to performance. You are splitting using whitespace in one case and newline in another.
However it would impact in case of word in a way that you would need to allocate buffers M times, while in case of lines it will be N times, where M>N. So if you are adopting word split approach, try to calculate total memory need first, allocate that much chunk (so you don't end up with fragmented M chunks), and later get M buffers from that chunk. Note that same approach can be applied in lines split but the difference will be less visible.
This is correct, you should read them in to a buffer, and then split into whatever you define as 'words'.
The only case where this would not be true is if you can get fscanf() to correctly grab out what you consider to be words (doubtful).
The major performance bottlenecks will likely be:
Any call to a stdio file I/O function. The less calls, the less overhead.
Dynamic memory allocation. Should be done as scarcely as possible. Ultimately, a lot of calls to malloc will cause heap segmentation.
So what it boils down to is a classic programming consideration: you can get either quick execution time or you can get low memory usage. You can't get both, but you can find some suitable middle-ground that is most effective both in terms of execution time and memory consumption.
To one extreme, the fastest possible execution can be obtained by reading the whole file as one big chunk and upload it to dynamic memory. Or to the other extreme, you can read it byte by byte and evaluate it as you read, which might make the program slower but will not use dynamic memory at all.
You will need a fundamental knowledge of various CPU-specific and OS-specific features to optimize the code most effectively. Issues like alignment, cache memory layout, the effectiveness of the underlying API function calls etc etc will all matter.
Why not try a few different ways and benchmark them?
Not actually answer to your exact question (words vs lines), but if you need all words in memory at the same time anyway, then the most efficient approach is this:
determine file size
allocate buffer for entire file plus one byte
read entire file to the buffer, and put '\0' to the extra byte.
make a pass over it and count how many words it has
allocate char* (pointers to words) or int (indexes to buffer) index array, with size matching word count
make 2nd pass over buffer, and store addresses or indexes to the first letters of words to the index array, and overwrite other bytes in buffer with '\0' (end of string marker).
If you have plenty of memory, then it's probably slightly faster to just assume the worst case for number of words: (filesize+1) / 2 (one letter words with one space in between, with odd number of bytes in file). Also taking the Java ArrayList or Qt QVector approach with the index array, and using realloc() to double it's size when word count exceeds current capacity, will be quite efficient (due to doubling=exponential growth, reallocation will not happen many times).
I need to write a program where during run time, a set of integers of arbitrary size will taken as input. They will be seperated by white space. At the end, a new line is given, showing the end of input. How do I save them into an array of integers so that i can display them later. I think it is a little difficult because the number of values that will be entered is not known during compilation
Sounds like homework.
Correct me if I am wrong and I will give you more than hints.
You can either declare an array of a really large size that would not possibly be filled by the user input, then use scanf or something like that to grab the integers until you hit '\n', or you can grab each integer at a time, allocating memory as you go, using a combination of malloc and memcpy calls. The first option should never be done in a real world problem, and I am certainly not advocating such practices even though your textbook probably tells you to do it this way.
There is an example just like this in K&R.
This is a typical problem you will have in C. The solution is usually one of two options.
Use a really large array that is large enough to hold the input. Sometimes this is a poor option when the data could be really large. An example of when it would be a bad idea is when you are saving a video frame or a large text file to the array. This also opens you up to a buffer overrun attack in older versions of Windows. However, this is sometimes a good quick hack solution for smaller (homework) programs where you can count on the user (i.e. your professor who is not trying to break your program) to not input 1000's of characters. Usually this is considered bad practice, please consider my 2nd option for the security reason I mentioned before.
Use dynamic arrays (i.e. malloc). This is probably what your professor wants you to do as this sounds like a typical problem to use when a student is first learning pointers and arrays. This is a great approach, just remember to call free on your memory when you are finished. The tricky part here is that you still have to know the size of the array you want ahead of time (not at compile time though of course).
I'm trying to feed an array with fscanf() while looping through a file containing a list of integers, n integers long. It seems that I need to use malloc and/or potentially realloc. I've heard that the malloc command takes a noticeable amount of execution time and that it's best to over-allocate. Would someone mind helping me out with understanding the building blocks of accomplishing this end?
Disclaimer: I'm new to C.
No, what you've heard is misleading (at least to me). malloc is a just a function, and usually a fast one.
Most of the time it does all of its job in user-space. It "overallocates" so you don't have to
The bookkeeping (the linked list with free blocks etc.) is highly optimized since virtually everyone uses malloc
It's unrealistic to think you can easily beat malloc at this game. I am sorry if this doesn't answer your question (which was pretty general) but you have to realize there is no (spoon) optimization you can easily implement.
Reading the file will be much slower than allocating the memory!
You may want to read the whole file and find out how many entires you want and then malloc() all in one go.
malloc(sizeof(int)*n)
Premature optimization is the root of all evil (google it).
That said, allocate whatever amount you guess is reasonable/typical for the task at hand, and double it whenever you have to realloc. This strategy is rather hard to beat.
For your specific case, malloc isn't going to cause you issues. The run time of fscanf will be many, many times slower than the overhead of malloc and free. But, it can add up in high performance areas of an app. In these areas, there are other ways such as mem pools and fixed size allocators than can combat malloc()'s overhead. But, you are no where near needing to worry about performance overhead when you are just starting out.
Note that malloc() adds some overhead to each allocation to maintain its internal data structures (at least 4 bytes in common implementations), so if you integers are 4 bytes long, doing a malloc() for each integer would have >= 50% overhead (probably 75%). This would be the equivalent of using an array of Integer's in Java, instead of an array of int's.
As #Charles Dowd said, it's much better to allocate all the memory in one go, to avoid overhead.
You don't want to call malloc or realloc with every integer read, that's for sure. Can you estimate how much space you will need? Do you control the file format? If so, you could have the first line of the file be a single integer that denotes how many integers are to be read from the file. Then you could allocate all the space you need in one go. If you don't control the format and can't do this, follow the other suggestion mentioned in this thread: allocate a reasonably-sized buffer, and double it every time you run out of space.
It's a text file (not binary) and not in a fixed format, right? Otherwise it would be easy to calculate the size of the array from the file size ( buffer_size = file_size / record_size , buffersize is in words (the size of an int), the other sizes is in bytes).
This is what I would do (but I'm a bit of a nutter when it comes to applied statistics).
1) What is the maximum number of characters (a.k.a. bytes) a number (a.k.a. record) will occupy in the file, don't forget to include the end-of-line characters (CR, NF) and other blank glyphs (spaces, tabs et.c.)? If you already can estimate what the average size of a record would be, then it is even better, you use that instead of the maximum size.
initial_buffer_size = file_size / max_record_size + 1 (/ is integer division)
2) Allocate that buffer, read your integers into that buffer until it is full. If the whole file is read then you are finished, otherwise resize or reallocate buffer to meet your new estimated needs.
resize_size =
prev_buffer_size
+ bytes_not_read / ( bytes_already_read / number_of_records_already_read )
+ 1
3) Read into that buffer (from where the previous reading ended) until it is full, or all the of file has been read.
4) If not finished, repeat from step 2) with the new prev_buffer_size.
This will work best if the numbers (records) are totally randomly distributed from a byte-size point of view. If not, and if you have a clue what kind of distribution they have, you can adjust the algorithm according to that.