For didactic purposes, I am working on a program that reads a string (array of chars) from standard input. The goal is to allow the program to sequentially increase the memory allocated according to the dimension of the input. I would like your opinion on my approach.
I thought I could allocate one byte of space one by one, for every reading cycle needed. Clearly, it does not work. How could I approach this problem? Is it even worth trying?
Thank you for your patience and support!
#include<stdio.h>
#include<stdlib.h>
#include<ctype.h>
int main(){
char *q;
int flag = 1, j = 0;
printf("\n\nNow let's go for a word of undefined lenght. Type it:\n\n");
do{
q++ = calloc(1, sizeof(char));
flag = ((q[j] = getchar()) != 0); #Until it is valid.
++j;
}while(flag);
return 0;
}
Reading the entire input at once is typically wrong.
The standard approach is to get a buffer of few kilobytes and read + process the data in chunks of that size, overwriting previously read data which is now useless.
In rare cases where you need to have the entire thing in ram and are reading from a regular file, you can fstat the file to get its size and allocate accordingly. If the file is big (megabytes in size), you should mmap.
Finally, in extremely rare case where you need to read stuff up and you can't know the size in advance, the way is to realloc doubling the size each time. i.e. size *= 2; new = realloc(p, size); p = new; ....
Related
Huge thanks to everyone that answered , i have realised that i suck a lot at this, i will take every answer into consideration and hopefully i will manage to compile something that is working
Some remarks:
Allocating 500 MB just in case doesn't seem like a good idea. A better approach would be to allocate a small amount of memory first, if it's not enough then allocate 2 times bigger memory, etc (this would work if you read the number on per-character basis).
Important: right after every (re)allocation, you have to check whether your malloc call succeeded (i.e. what it returns is not NULL), otherwise you cannot go any further.
what the first getchar() is for?
instead of using gets(), you could try to read the characters one-by-one, until you encounter something that is not a number, at which point you can assume that the number input has finished (that is the simplest way, obviously one can process user input differently).
adding '\0' for something that was read with gets() is not needed, afaik (for something that would be read character-by-character, that would make sense).
Last but not least, you should also take care of actually freeing the allocated memory (i.e. calling free() after you are done with num). Not doing so results in a memory leak.
(Update) printf("%c",num[0]); will only print the first character of the string num. If you want to print out the whole string, you should call printf("%s",num);
Well, there are quite a few problems with this code, none that necessarily have to do with reading big numbers. But you're still learning, so here we go. In order in which they appear in the code:
(Not really an error, but also not recommended): Casting the result of malloc is unnecessary, as outlined in this answer.
As the other answer states: allocating 500MB is probably way overkill, if you really need this much you can always add more, but you may want to start out with less (5KB, for example).
You should add a new-line at the end of your puts, or the output may end up in places where you don't expect it (i.e. much later).
(This is an error) Don't ever use gets: this page explains why.
You're checking if(num == NULL) after you've already used it (presumably to check if gets failed, but it will return NULL on failure, the num pointer itself won't be changed). You want to move this check up to right after the malloc.
After your NULL-check for num your code happily continues after the if, you'll want to add a return or exit inside the if's body.
There is a syntax error with your very last printf: you forgot the closing ].
When you decide to use fgets to get the user input, you can check if the last character in the string is a new-line. If it isn't then that means it couldn't fit the entire input into the string, so you will need to fgets some more. When the last character is a new-line you might want to remove that (use num[len]='\0'; trick that isn't necessary for gets, but is for fgets).
Instead of increasing the size of your buffer by just 1, you should grow it by a bit more than that: a common used value is to just double the current size. malloc, calloc and realloc are fairly expensive system-calls (performance-wise) and since you don't seem too fussed about memory-usage it can save a lot of time keeping these calls to a minimum.
An example of these recommendations:
size_t bufferSize = 5000, // start with 5K
inputLength = 0;
char * buffer = malloc(bufferSize);
if(buffer == NULL){
perror("No memory!");
exit(-1);
}
while(fgets(buffer, bufferSize, stdin) != NULL){
inputLength = strlen(buffer);
if(buffer[inputLength] != '\n'){ // last character was not a new-line
bufferSize *= 2; // double the buffer in size
char * tmp = realloc(buffer, bufferSize);
if(tmp == NULL){
perror("No memory!");
free(buffer);
exit(-1);
}
// reallocating didn't fail: continue with grown buffer
buffer = tmp;
}else{
break; // last character was a new-line: were done reading
}
}
Beware of bugs in the above code; I have only proved it correct, not tried it.
Finally, instead of re-inventing the wheel, you may want to take a look at the GNU Multiple Precision library which is specifically made for handling big numbers. If anything you can use it for inspiration.
This is how you could go about reading some really big numbers in. I have decided on your behalf that a 127 digit number is really big.
#include <stdio.h>
#include <stdlib.h>
#define BUFSIZE 128
int main()
{
int n, number, len;
char *num1 = malloc(BUFSIZE * sizeof (char));
if(num1==NULL){
puts("Not enough memory");
return 1;
}
char *num2 = malloc(BUFSIZE * sizeof (char));
if(num2==NULL){
puts("Not enough memory");
return 1;
}
puts("Please enter your first number");
fgets(num1, BUFSIZE, stdin);
puts("Please enter your second number");
fgets(num2, BUFSIZE, stdin);
printf("Your first number is: %s\n", num1);
printf("Your second number is: %s\n", num2);
free(num1);
free(num2);
return 0;
}
This should serve as a starting point for you.
What's the best way to determine the length of an input stream in Stdin so that you can create an array of the correct length to store it using, say, getchar()?
Is there some way of peeking at all the characters in the input stream and using something like:
while((ch = readchar()) != "\n" ) {
count++;
}
and then creating the array with size count?
During the time I typed the code, there are several similar answers. I am afraid you will need to do something like:
int size = 1;
char *input = malloc(size);
int count = 0;
while((ch = getchar()) != '\n' ) {
input[count++] = ch;
if (count >= size) {
size = size * 2;
input = realloc(input, size);
}
}
input[count++] = 0;
input = realloc(input, count);
Alternatively you can use the same as a POSIX library function getline(). I.e.
int count, size;
char *input = NULL;
count = getline(&input, &size, stdin);
In both cases, do not forget to free input once you have finished with it.
Generally there is no way. You can peek only one character ahead, so if you use code like in your example, the characters are read already, and even if you know their count and can allocate the memory, you cannot read them again.
The possible strategy is to allocate some memory at the beginning, and then in the loop if you are hitting the limit reallocate the memory, doubling the length.
The one way to do this with typical unix files is to use the fseek system call to determine the size of the file. Unfortunately, STDIN is often not a seekable stream.
The only way to handle the general case I know of is to simply use dynamic memory allocation. You make the best guess with an initial buffer and them once you reach the end, you malloc a new array and start all over again. Mistakes in handling this process are the start of many classic security bugs.
This question already has answers here:
buffering a set of lines from a file and storing it in an array in C
(3 answers)
Closed 8 years ago.
I am trying to read all strings from a text file and save them in an array of strings.
However, when I try to print the contents of my string array only the last part is printed.
Why does my code only copy the last string?
My code
# include <stdio.h>
# define BUFFERSIZE 100
int main(int argc, char *argv[]){
char buffer[BUFFERSIZE];
int i = 0;
char *text[BUFFERSIZE];
while(fgets(buffer, BUFFERSIZE, stdin) != NULL){
text[i] = buffer;
i++;
}
int j = 0;
for (j=0; j<sizeof(text)/sizeof(char); i++){
printf("%s\n", text[j]);
}
return 0;
}
My text file
ESD can create spectacular electric sparks (thunder and lightning is a large-scale ESD event), but also less dramatic forms which may be neither seen nor heard, yet still be large enough to cause damage to sensitive electronic devices.
Electric sparks require a field strength above about 4 kV/cm in air, as notably occurs in lightning strikes. Other forms of ESD include corona discharge from sharp electrodes and brush discharge from blunt electrodes.
Output
>>> make 1_17; ./1_17 < example.txt
m blunt electrodes.
m blunt electrodes.
m blunt electrodes.
m blunt electrodes.
...
There are two issues. The first is that for all you is, text[i] contains the same buffer which you've used multiple times. The second is that in your printing code, you're only printing text[0].
Using the same buffer
There's only one buffer declared,
char buffer[BUFFERSIZE];
and while you modify its contents many times in the loop, it's always the same buffer (i.e., the same storage area in memory), so
text[i] = buffer;
makes every element of text contain the (address of) the same buffer. You need to copy the contents of buffer into a new string and store that in text[i] instead. You can duplicate the string using, e.g., strdup if you can use POSIX functions, as in
text[i] = strdup(buffer);
strdup uses malloc to allocate the space for the string, so if you're using this in a larger application, be sure to free those strings later. In your simple application, though, they'll be freed when the application exits, so you're not in too much trouble.
If you can only use standard C functions, you'll probably want strcpy which will make you do a little bit more work. (You'll need to allocate a string big enough to hold the current contents of buffer, and then copy buffer's contents into it. You'll still want to free them afterwards.
Printing only text[0]
However, you've also got an issue with your printing code. You're indexing into text with j, but never modifying j (you're incrementing i with i++), so you're always printing the same string (which is actually the first in the array, not the last, but its contents are the same as the lsat string that you read from the file):
int j = 0;
for (j=0; j<sizeof(text)/sizeof(char); i++){
printf("%s\n", text[j]);
}
After the first loop, i is the number of strings that you've got, so you probably just want:
int j;
for ( j=0; j < i; j++ ) {
printf("%s\n", text[j]);
}
Also, in addition to the answers already posted, you're incrementing i in your for loop, not j.
Look at this loop:
while(fgets(buffer, BUFFERSIZE, stdin) != NULL){
text[i] = buffer;
i++;
}
You're setting every element of your array to the exact same value: the value of buffer, which is a pointer to a piece of memory. On every iteration you're reading data over the top of where you read it last time.
To fix this, you'd need to allocate a new buffer on each iteration, and set the array value to a pointer to that new buffer. That way each element of the array will point to a different section of memory.
I am working on a machine learning application where my features are stored in huge text files. Currently the way I have implemented the data input reads, it is way to slow to be practical. Basically each line of the text file represents a feature vector in sparse format. For instance, following example contains three features in index:value fashion.
1:0.34 2:0.67 6:0.99 12:2.1 28:2.1
2:0.12 22:0.27 26:9.8 69:1.8
3:0.24 4:67.0 7:1.9 13:8.1 18:1.7 32:3.4
Following is how I am making the reads now. As I dont know the length of the feature string before hand, I just read a suitably large length which upper bounds the length of each string. Once, I have read the line from the file, I just use the strtok_r function to split the string into key value pairs and then further process it to store as a sparse array. Any ideas on how to speed this up are highly appreciated.
FILE *fp = fopen(feature_file, "r");
int fvec_length = 0;
char line[1000000];
size_t ln;
char *pair, *single, *brkt, *brkb;
SVECTOR **fvecs = (SVECTOR **)malloc(n_fvecs*sizeof(SVECTOR *));
if(!fvecs) die("Memory Error.");
int j = 0;
while( fgets(line,1000000,fp) ) {
ln = strlen(line) - 1;
if (line[ln] == '\n')
line[ln] = '\0';
fvec_length = 0;
for(pair = strtok_r(line, " ", &brkt); pair; pair = strtok_r(NULL, " ", &brkt)){
fvec_length++;
words = (WORD *) realloc(words, fvec_length*sizeof(WORD));
if(!words) die("Memory error.");
j = 0;
for (single = strtok_r(pair, ":", &brkb); single; single = strtok_r(NULL, ":", &brkb)){
if(j == 0){
words[fvec_length-1].wnum = atoi(single);
}
else{
words[fvec_length-1].weight = atof(single);
}
j++;
}
}
fvec_length++;
words = (WORD *) realloc(words, fvec_length*sizeof(WORD));
if(!words) die("Memory error.");
words[fvec_length-1].wnum = 0;
words[fvec_length-1].weight = 0.0;
fvecs[i] = create_svector(words,"",1);
free(words);
words = NULL;
}
fclose(fp);
return fvecs;
You should absolutely reduce the number of memory allocations. The classic approach is to double the vector on each allocation, so that you get logarithmic number of allocation calls rather than linear.
Since your line pattern seems constant, there's no need to tokenize it by hand, use a single sscanf() on each loaded line to directly scan into that line's words.
Your line buffer seems extremely large, this can cost by blowing up the stack, worsening cache locality a bit.
Probally when you are calling realloc, you are doing a system call. A system call is an expensive operational that involves a context exchange and switching from user to kernel space and vice versa.
It seems that you are doing a realloc call for each pair of token that you got. It is a lot of calls. You dont care in previous allocating 1MByte to a buffer pointed by file. Why are you so conservative about the buffer pointed by word?
I find that on Linux (Fedora) realloc() is extremely efficient and does not slow things down, particularly. On Windows, because of how the memory is structured, it can be disastrous.
My solution to the "lines of unknown length" problem is to write a function which makes multiple calls to fgets(), concatenating the results until a newline character is detected. The function accepts &maxlinelength as an argument, and if any call to fgets() would cause the concatenated string to exceed maxlinelength, then maxlinelength is adjusted. This way new memory is reallocated only until the longest line is found. Similarly, you would only need to realloc() for WORD if maxlinelength had been adjusted
I find fwrite fails when I am trying to write somewhat big data as in the following code.
#include <stdio.h>
#include <unistd.h>
int main(int argc, char* argv[])
{
int size = atoi(argv[1]);
printf("%d\n", size);
FILE* fp = fopen("test", "wb");
char* c = "";
int i = fwrite(c, size, 1, fp);
fclose(fp);
printf("%d\n", i);
return 0;
}
The code is compiled into binary tw
When I try ./tw 10000 it works well. But when I try something like ./tw 12000 it fails.(fwrite() returns 0 instead of 1)
What's the reason of that? In what way can I avoid this?
EDIT: When I do fwrite(c, 1, size, fp) it returns 8192 instead of larger size I give.
2nd EDIT: When I write a loop that runs for size times, and fwrite(c, 1, 1, fp) each time, it work perfectly OK.
It seems when size is too large(as in the first EDIT) it only writes about 8192 bytes.
I guess something has limited fwrite write up to fixed size bytes at a time.
3rd EDIT: The above is not clear.
The following fails for space - w_result != 0 when space is large, where space is determined by me and w_result is object written in total.
w_result = 0;
char* empty = malloc(BLOCKSIZE * size(char));
w_result = fwrite(empty, BLOCKSIZE, space, fp);
printf("%d lost\n", space - w_result);
While this works OK.
w_result = 0;
char* empty = malloc(BLOCKSIZE * sizeof(char));
for(i = 0; i < space; i ++)
w_result += fwrite(empty, BLOCKSIZE, 1, fp);
printf("%d lost\n", space - w_result);
(every variable has been declared.)
I corrected some errors the answers memtioned. But the first one should work according to you.
With fwrite(c, size, 1, fp); you state that fwrite should write 1 item that is size big , big out of the buffer c.
c is just a pointer to an empty string. It has a size of 1. When you tell fwrite to go look for more data than 1 byte in c , you get undefined behavior. You cannot fwrite more than 1 byte from c.
(undefined behavior means anything could happen, it could appear to work fine when you try with a size of 10000 and not with a size of 12000. The implementation dependent reason for that is likely that there is some memory available, perhaps the stack, starting at c and 10000 bytes forward, but at e.g. 11000 there is no memory and you get a segfault)
You are reading memory that doesn't belong to your program (and writing it to a file).
Test your program using valgrind to see the errors.
From that snippet of code, it looks like you're trying to write what's at c, which is just a single NULL byte, to the file pointer, and you're doing so "size" times. The fact that it doesn't crash with 10000 is coincidental. What are you trying to do?
As has been stated by others the code is performing an invalid memory read via c.
A possible solution would be to dynamically allocate a buffer that is size bytes in size, initialise it, and fwrite() it to the file, remembering to deallocate the buffer afterwards.
Remember to check return values from functions (fopen() for example).