Possible alternatives to speed up reads from a text file in c? - c

I am working on a machine learning application where my features are stored in huge text files. Currently the way I have implemented the data input reads, it is way to slow to be practical. Basically each line of the text file represents a feature vector in sparse format. For instance, following example contains three features in index:value fashion.
1:0.34 2:0.67 6:0.99 12:2.1 28:2.1
2:0.12 22:0.27 26:9.8 69:1.8
3:0.24 4:67.0 7:1.9 13:8.1 18:1.7 32:3.4
Following is how I am making the reads now. As I dont know the length of the feature string before hand, I just read a suitably large length which upper bounds the length of each string. Once, I have read the line from the file, I just use the strtok_r function to split the string into key value pairs and then further process it to store as a sparse array. Any ideas on how to speed this up are highly appreciated.
FILE *fp = fopen(feature_file, "r");
int fvec_length = 0;
char line[1000000];
size_t ln;
char *pair, *single, *brkt, *brkb;
SVECTOR **fvecs = (SVECTOR **)malloc(n_fvecs*sizeof(SVECTOR *));
if(!fvecs) die("Memory Error.");
int j = 0;
while( fgets(line,1000000,fp) ) {
ln = strlen(line) - 1;
if (line[ln] == '\n')
line[ln] = '\0';
fvec_length = 0;
for(pair = strtok_r(line, " ", &brkt); pair; pair = strtok_r(NULL, " ", &brkt)){
fvec_length++;
words = (WORD *) realloc(words, fvec_length*sizeof(WORD));
if(!words) die("Memory error.");
j = 0;
for (single = strtok_r(pair, ":", &brkb); single; single = strtok_r(NULL, ":", &brkb)){
if(j == 0){
words[fvec_length-1].wnum = atoi(single);
}
else{
words[fvec_length-1].weight = atof(single);
}
j++;
}
}
fvec_length++;
words = (WORD *) realloc(words, fvec_length*sizeof(WORD));
if(!words) die("Memory error.");
words[fvec_length-1].wnum = 0;
words[fvec_length-1].weight = 0.0;
fvecs[i] = create_svector(words,"",1);
free(words);
words = NULL;
}
fclose(fp);
return fvecs;

You should absolutely reduce the number of memory allocations. The classic approach is to double the vector on each allocation, so that you get logarithmic number of allocation calls rather than linear.
Since your line pattern seems constant, there's no need to tokenize it by hand, use a single sscanf() on each loaded line to directly scan into that line's words.
Your line buffer seems extremely large, this can cost by blowing up the stack, worsening cache locality a bit.

Probally when you are calling realloc, you are doing a system call. A system call is an expensive operational that involves a context exchange and switching from user to kernel space and vice versa.
It seems that you are doing a realloc call for each pair of token that you got. It is a lot of calls. You dont care in previous allocating 1MByte to a buffer pointed by file. Why are you so conservative about the buffer pointed by word?

I find that on Linux (Fedora) realloc() is extremely efficient and does not slow things down, particularly. On Windows, because of how the memory is structured, it can be disastrous.
My solution to the "lines of unknown length" problem is to write a function which makes multiple calls to fgets(), concatenating the results until a newline character is detected. The function accepts &maxlinelength as an argument, and if any call to fgets() would cause the concatenated string to exceed maxlinelength, then maxlinelength is adjusted. This way new memory is reallocated only until the longest line is found. Similarly, you would only need to realloc() for WORD if maxlinelength had been adjusted

Related

How to read in the entire word, and not just the first character?

I am writing a method in C in which I have a list of words from a file that I am redirecting from stdin. However, when I attempt to read in the words into the array, my code will only output the first character. I understand that this is because of a casting issue with char and char *.
While I am challenging myself to not use any of the functions from string.h, I have tried iterating through and am thinking of writing my own strcpy function, but I am confused because my input is coming from a file that I am redirecting from standard input. The variable numwords is inputted by the user in the main method (not shown).
I am trying to debug this issue via dumpwptrs to show me what the output is. I am not sure what in the code is causing me to get the wrong output - whether it is how I read in words to the chunk array, or if I am pointing to it incorrectly with wptrs?
//A huge chunk of memory that stores the null-terminated words contiguously
char chunk[MEMSIZE];
//Points to words that reside inside of chunk
char *wptrs[MAX_WORDS];
/** Total number of words in the dictionary */
int numwords;
.
.
.
void readwords()
{
//Read in words and store them in chunk array
for (int i = 0; i < numwords; i++) {
//When you use scanf with '%s', it will read until it hits
//a whitespace
scanf("%s", &chunk[i]);
//Each entry in wptrs array should point to the next word
//stored in chunk
wptrs[i] = &chunk[i]; //Assign address of entry
}
}
Do not re-use char chunk[MEMSIZE]; used for prior words.
Instead use the next unused memory.
char chunk[MEMSIZE];
char *pool = chunk; // location of unassigned memory pool
// scanf("%s", &chunk[i]);
// wptrs[i] = &chunk[i];
scanf("%s", pool);
wptrs[i] = pool;
pool += strlen(pool) + 1; // Beginning of next unassigned memory
Robust code would check the return value of scanf() and insure i, chunk do not exceed limits.
I'd go for a fgets() solution as long as words are entered a line at a time.
char chunk[MEMSIZE];
char *pool = chunk;
// return word count
int readwords2() {
int word_count;
// limit words to MAX_WORDS
for (word_count = 0; word_count < MAX_WORDS; word_count++) {
intptr_t remaining = &chunk[MEMSIZE] - pool;
if (remaining < 2) {
break; // out of useful pool memory
}
if (fgets(pool, remaining, stdin) == NULL) {
break; // end-of-file/error
}
pool[strcspn(pool, "\n")] = '\0'; // lop off potential \n
wptrs[word_count] = pool;
pool += strlen(pool) + 1;
}
return word_count;
}
While I am challenging myself to not use any of the functions from string.h, ...
The best way to challenge yourself to not use any of the functions from string.h is to write them yourself and then use them.
your program reads the next word in the i-esim position of the buffer chunk, so you are getting the first letters of each word (as long as i doesn't get above the size of chunk) as each time you read, you overwrite the second and rest of the chars of the last word with the ones of the just read one. Then, you are putting all the pointers in wptrs to point to these places, making it impossible to distinguish the end of one string to the next (you overwrote all the null terminators, leaving only the last) so you will get a first string with all the first letters of your words but the last, which is complete. then the second will have the same string, but beginning at the second... then the third.... etc.
Build your own version of strdup(3) and use chunk to store temporarily the string... then make a dynamically allocated copy of the string with your version of strdup(3) and make the pointer to point to it.... etc.
Finally, when you are finished, just free all the allocated strings and voilĂ !!
Also, this is very important: read How to create a Minimal, Complete, and Verifiable example as it is very frequent that your code lacks of some errors that you have eliminated from the posted code (you don't normally know where the error is, or you would have corrected it and no question here, right?)

Gradual memory allocation strategy

For didactic purposes, I am working on a program that reads a string (array of chars) from standard input. The goal is to allow the program to sequentially increase the memory allocated according to the dimension of the input. I would like your opinion on my approach.
I thought I could allocate one byte of space one by one, for every reading cycle needed. Clearly, it does not work. How could I approach this problem? Is it even worth trying?
Thank you for your patience and support!
#include<stdio.h>
#include<stdlib.h>
#include<ctype.h>
int main(){
char *q;
int flag = 1, j = 0;
printf("\n\nNow let's go for a word of undefined lenght. Type it:\n\n");
do{
q++ = calloc(1, sizeof(char));
flag = ((q[j] = getchar()) != 0); #Until it is valid.
++j;
}while(flag);
return 0;
}
Reading the entire input at once is typically wrong.
The standard approach is to get a buffer of few kilobytes and read + process the data in chunks of that size, overwriting previously read data which is now useless.
In rare cases where you need to have the entire thing in ram and are reading from a regular file, you can fstat the file to get its size and allocate accordingly. If the file is big (megabytes in size), you should mmap.
Finally, in extremely rare case where you need to read stuff up and you can't know the size in advance, the way is to realloc doubling the size each time. i.e. size *= 2; new = realloc(p, size); p = new; ....

Reading numbers

Huge thanks to everyone that answered , i have realised that i suck a lot at this, i will take every answer into consideration and hopefully i will manage to compile something that is working
Some remarks:
Allocating 500 MB just in case doesn't seem like a good idea. A better approach would be to allocate a small amount of memory first, if it's not enough then allocate 2 times bigger memory, etc (this would work if you read the number on per-character basis).
Important: right after every (re)allocation, you have to check whether your malloc call succeeded (i.e. what it returns is not NULL), otherwise you cannot go any further.
what the first getchar() is for?
instead of using gets(), you could try to read the characters one-by-one, until you encounter something that is not a number, at which point you can assume that the number input has finished (that is the simplest way, obviously one can process user input differently).
adding '\0' for something that was read with gets() is not needed, afaik (for something that would be read character-by-character, that would make sense).
Last but not least, you should also take care of actually freeing the allocated memory (i.e. calling free() after you are done with num). Not doing so results in a memory leak.
(Update) printf("%c",num[0]); will only print the first character of the string num. If you want to print out the whole string, you should call printf("%s",num);
Well, there are quite a few problems with this code, none that necessarily have to do with reading big numbers. But you're still learning, so here we go. In order in which they appear in the code:
(Not really an error, but also not recommended): Casting the result of malloc is unnecessary, as outlined in this answer.
As the other answer states: allocating 500MB is probably way overkill, if you really need this much you can always add more, but you may want to start out with less (5KB, for example).
You should add a new-line at the end of your puts, or the output may end up in places where you don't expect it (i.e. much later).
(This is an error) Don't ever use gets: this page explains why.
You're checking if(num == NULL) after you've already used it (presumably to check if gets failed, but it will return NULL on failure, the num pointer itself won't be changed). You want to move this check up to right after the malloc.
After your NULL-check for num your code happily continues after the if, you'll want to add a return or exit inside the if's body.
There is a syntax error with your very last printf: you forgot the closing ].
When you decide to use fgets to get the user input, you can check if the last character in the string is a new-line. If it isn't then that means it couldn't fit the entire input into the string, so you will need to fgets some more. When the last character is a new-line you might want to remove that (use num[len]='\0'; trick that isn't necessary for gets, but is for fgets).
Instead of increasing the size of your buffer by just 1, you should grow it by a bit more than that: a common used value is to just double the current size. malloc, calloc and realloc are fairly expensive system-calls (performance-wise) and since you don't seem too fussed about memory-usage it can save a lot of time keeping these calls to a minimum.
An example of these recommendations:
size_t bufferSize = 5000, // start with 5K
inputLength = 0;
char * buffer = malloc(bufferSize);
if(buffer == NULL){
perror("No memory!");
exit(-1);
}
while(fgets(buffer, bufferSize, stdin) != NULL){
inputLength = strlen(buffer);
if(buffer[inputLength] != '\n'){ // last character was not a new-line
bufferSize *= 2; // double the buffer in size
char * tmp = realloc(buffer, bufferSize);
if(tmp == NULL){
perror("No memory!");
free(buffer);
exit(-1);
}
// reallocating didn't fail: continue with grown buffer
buffer = tmp;
}else{
break; // last character was a new-line: were done reading
}
}
Beware of bugs in the above code; I have only proved it correct, not tried it.
Finally, instead of re-inventing the wheel, you may want to take a look at the GNU Multiple Precision library which is specifically made for handling big numbers. If anything you can use it for inspiration.
This is how you could go about reading some really big numbers in. I have decided on your behalf that a 127 digit number is really big.
#include <stdio.h>
#include <stdlib.h>
#define BUFSIZE 128
int main()
{
int n, number, len;
char *num1 = malloc(BUFSIZE * sizeof (char));
if(num1==NULL){
puts("Not enough memory");
return 1;
}
char *num2 = malloc(BUFSIZE * sizeof (char));
if(num2==NULL){
puts("Not enough memory");
return 1;
}
puts("Please enter your first number");
fgets(num1, BUFSIZE, stdin);
puts("Please enter your second number");
fgets(num2, BUFSIZE, stdin);
printf("Your first number is: %s\n", num1);
printf("Your second number is: %s\n", num2);
free(num1);
free(num2);
return 0;
}
This should serve as a starting point for you.

Determining necessary array length to store input

What's the best way to determine the length of an input stream in Stdin so that you can create an array of the correct length to store it using, say, getchar()?
Is there some way of peeking at all the characters in the input stream and using something like:
while((ch = readchar()) != "\n" ) {
count++;
}
and then creating the array with size count?
During the time I typed the code, there are several similar answers. I am afraid you will need to do something like:
int size = 1;
char *input = malloc(size);
int count = 0;
while((ch = getchar()) != '\n' ) {
input[count++] = ch;
if (count >= size) {
size = size * 2;
input = realloc(input, size);
}
}
input[count++] = 0;
input = realloc(input, count);
Alternatively you can use the same as a POSIX library function getline(). I.e.
int count, size;
char *input = NULL;
count = getline(&input, &size, stdin);
In both cases, do not forget to free input once you have finished with it.
Generally there is no way. You can peek only one character ahead, so if you use code like in your example, the characters are read already, and even if you know their count and can allocate the memory, you cannot read them again.
The possible strategy is to allocate some memory at the beginning, and then in the loop if you are hitting the limit reallocate the memory, doubling the length.
The one way to do this with typical unix files is to use the fseek system call to determine the size of the file. Unfortunately, STDIN is often not a seekable stream.
The only way to handle the general case I know of is to simply use dynamic memory allocation. You make the best guess with an initial buffer and them once you reach the end, you malloc a new array and start all over again. Mistakes in handling this process are the start of many classic security bugs.

Why do I always get the last elements of my string array? [duplicate]

This question already has answers here:
buffering a set of lines from a file and storing it in an array in C
(3 answers)
Closed 8 years ago.
I am trying to read all strings from a text file and save them in an array of strings.
However, when I try to print the contents of my string array only the last part is printed.
Why does my code only copy the last string?
My code
# include <stdio.h>
# define BUFFERSIZE 100
int main(int argc, char *argv[]){
char buffer[BUFFERSIZE];
int i = 0;
char *text[BUFFERSIZE];
while(fgets(buffer, BUFFERSIZE, stdin) != NULL){
text[i] = buffer;
i++;
}
int j = 0;
for (j=0; j<sizeof(text)/sizeof(char); i++){
printf("%s\n", text[j]);
}
return 0;
}
My text file
ESD can create spectacular electric sparks (thunder and lightning is a large-scale ESD event), but also less dramatic forms which may be neither seen nor heard, yet still be large enough to cause damage to sensitive electronic devices.
Electric sparks require a field strength above about 4 kV/cm in air, as notably occurs in lightning strikes. Other forms of ESD include corona discharge from sharp electrodes and brush discharge from blunt electrodes.
Output
>>> make 1_17; ./1_17 < example.txt
m blunt electrodes.
m blunt electrodes.
m blunt electrodes.
m blunt electrodes.
...
There are two issues. The first is that for all you is, text[i] contains the same buffer which you've used multiple times. The second is that in your printing code, you're only printing text[0].
Using the same buffer
There's only one buffer declared,
char buffer[BUFFERSIZE];
and while you modify its contents many times in the loop, it's always the same buffer (i.e., the same storage area in memory), so
text[i] = buffer;
makes every element of text contain the (address of) the same buffer. You need to copy the contents of buffer into a new string and store that in text[i] instead. You can duplicate the string using, e.g., strdup if you can use POSIX functions, as in
text[i] = strdup(buffer);
strdup uses malloc to allocate the space for the string, so if you're using this in a larger application, be sure to free those strings later. In your simple application, though, they'll be freed when the application exits, so you're not in too much trouble.
If you can only use standard C functions, you'll probably want strcpy which will make you do a little bit more work. (You'll need to allocate a string big enough to hold the current contents of buffer, and then copy buffer's contents into it. You'll still want to free them afterwards.
Printing only text[0]
However, you've also got an issue with your printing code. You're indexing into text with j, but never modifying j (you're incrementing i with i++), so you're always printing the same string (which is actually the first in the array, not the last, but its contents are the same as the lsat string that you read from the file):
int j = 0;
for (j=0; j<sizeof(text)/sizeof(char); i++){
printf("%s\n", text[j]);
}
After the first loop, i is the number of strings that you've got, so you probably just want:
int j;
for ( j=0; j < i; j++ ) {
printf("%s\n", text[j]);
}
Also, in addition to the answers already posted, you're incrementing i in your for loop, not j.
Look at this loop:
while(fgets(buffer, BUFFERSIZE, stdin) != NULL){
text[i] = buffer;
i++;
}
You're setting every element of your array to the exact same value: the value of buffer, which is a pointer to a piece of memory. On every iteration you're reading data over the top of where you read it last time.
To fix this, you'd need to allocate a new buffer on each iteration, and set the array value to a pointer to that new buffer. That way each element of the array will point to a different section of memory.

Resources