C - insert lines from very big file into array - c

First of all, I'm quite new with C, and I know this is a very repeated question, however, I could not find anything that could help me with my problem.
Here is my code:
It takes a text file and stores each line in an array.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
FILE *file;
file = fopen("test.txt", "r");
char buffer[600];
char *lines[10000];
int i = 0;
while(fgets(buffer, sizeof(buffer), file))
{
lines[i] = malloc(sizeof(buffer));
strcpy(lines[i], buffer);
i++;
free(lines[i]);
}
fclose(file);
return 1;
}
This works fine for small text files.
However it doesn't with large ones (even setting buffer and lines with much bigger numbers). Actually, if I increment buffer[] and *lines[] like 1000000 bytes, it doesn't give anything (if I understood well, it gives undefined behaviour). And I need to get this work with a 100.000 lines file with variable length lines,
So, how could I declare a very large array so I can pass each line? Since, as I exposed, it doesn't work with a large file.
Any help is appreciated!

char *lines[10000]; is just an array of pointers to the lines, not the array (memory) that is going to store the actual lines.
malloc is allocating a chunk of memory for each line, you are suppose to call free only when you are done using this chunk.
If you remove the free your solution would work, but you need to remember to free at some other point.

And I need to get this work with a 100.000 lines file with variable length lines,
So, how could I declare a very large array so I can pass each line?
This line
char *lines[10000];
gives you a variable with automatic storage duration - often called a local variable.
On most systems such a variable are located on a stack and most systems have a fixed limit for the size of the stack and thereby also a limit for the size of such a local variable.
So if you change the code to
char *lines[1000000];
to be able to handle larger files, it is likely that the variable use too much memory on the stack, i.e. you have a stack overflow.
A simple solution is to allocate the variable dynamically. Like:
char **lines = malloc(1000000 * sizeof *lines);
This will allocate 1000000 char-pointers and you can use lines as if it's an array - for instance like:
lines[i] = malloc(sizeof(buffer));
For something like this I'll also recommend that you take a look at realloc so that you can adjust the size of memory as needed.
Besides that your use of free seems strange and it's for sure wrong as you increment i between the malloc and the free.

You can allocate any space just as big as you need. So you will get rid of the fixed and limited numbers.
I have "massaged" your example in this way. The only thing I didn't is a first round through the file to obtain the longest line. So I kept the fixed buffer length.
Allocate only as many pointer to the lines as you need. For this you define a pointer to pointers to char.
Allocate only as many characters for each line as you need. This is done most conveniently with the function strdup(). If your library doesn't have it (it is not standard) you can replace it with the right combination of strlen(), malloc(), and strcpy(). How to do this is left as an exercise for you. ;-)
Handle allocation errors, especially if you plan to read huge files.
Free the allocated memories blocks, the sequence for the lines is not important. But lines has to be kept until all lines[*] are freed.
This is the code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
FILE *file;
file = fopen("test.txt", "r");
char buffer[600];
char **lines = NULL;
int i = 0;
while (fgets(buffer, sizeof(buffer), file))
{
lines = realloc(lines, (i + 1) * sizeof (char*));
if (lines == NULL)
{
// any error handling you like...
return EXIT_FAILURE;
}
lines[i] = strdup(buffer);
if (lines[i] == NULL)
{
// any error handling you like...
return EXIT_FAILURE;
}
i++;
}
fclose(file);
// work with the lines
for (int j = 0; j < i; ++j)
{
free(lines[j]);
}
free(lines);
return EXIT_SUCCESS;
}
Some notes:
Because of the realloc() on each line the run time of you program will scale bad for files with a giant number of lines. To improve this you might like to use some better algorithm, for example by allocating in steps of growing numbers. But this is a completely different issue.
You don't need to free allocated memory yourself at all if you need the memory until the end of the program. The C runtime will then free the memory automatically.

Related

Saving getline() output to an external array

The external array srclns should keep each read line from a text file. But reading it's content afterwards seems like read lines are empty strings. What am I missing in the code below?
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#define MAXSRC 20
char *srclns[MAXSRC]; /* source lines */
size_t read_lines(char *path)
{
FILE *stream;
ssize_t read;
char *lnptr;
size_t n;
size_t count;
stream = fopen(path, "r");
lnptr = NULL;
n = 0;
count = 0;
if (!stream) {
fprintf(stderr, "Can't open source '%s'\n", path);
exit(EXIT_FAILURE);
}
while ((read = getline(&lnptr, &n, stream)) != -1) {
srclns[count++] = lnptr;
}
free(lnptr);
fclose(stream);
return count;
}
int main()
{
size_t n = read_lines("foo.txt");
for (size_t i = 0; i<n; i++)
printf("%zu %s\n", i, srclns[i]);
exit(EXIT_SUCCESS);
}
This prints only the line numbers with seemingly empty strings afterwards:
0
1
2
3
4
5
So from what I can see not only does your program not work but it might have memory leaks. This is due to the behavior of getline which uses dynamic allocation.
Let's take a closer look at what your program does, in particular the while ((read = getline(&lnptr, &n, stream)) != -1) loop:
getline will work with &lnptr which is of type char**.
If the pointer is NULL it will allocate enough memory on heap (dynamic) to store the line that is being read.
If the pointer is not NULL then it is expected to point on a buffer of size n
If the buffer is big enough (greater or equal to the line length) it is used to store the string.
If the buffer is too small then memory is reallocated by getline so there is a big enough buffer available. Upon reallocation, n is updated to the new buffer size. And in certain cases reallocation will imply that lnptr has to be modified and will be. (This might happen if there is not enough consecutive memory free riIght after the current buffer. In that case the memory will be allocated somewhere else on heap. If this is of interest to you I suggest you research is because dynamic memory allocation is a rather complex topic, else just know the pointer might change, that's enough for now).
Now here are the issues with your program (at least this is what I can infer from the information I have. I might be wrong but this seems the most plausible interpretation):
On the first iteration of the loop lnptr is NULL. Thus getline allocates memory on heap and stores the line, and update lnptr to point on the newly allocated buffer.
Within the loop you store the pointer to the allocated buffer in srclns[0]
On the subsequent iterations the buffer is overwritten and maybe resized by getline, and you still store the pointer to the same buffer srclns[count].
After the loop you free the buffer and discard the memory every pointer in srclns points to.
When you print you most likely read an invalid memory zone (which is the zone pointed by the pointer you just freed) and luckily it seems to start with an termination character (Last line of your file was probably an empty line and nothing actively changed this memory zone after the free...)
How to fix it:
You could explicitly handle dynamic allocation with malloc and/or calloc but that seem a bit complicated and, as shown before, getline can handle it for you. My suggestion is as follow:
Set all your elements in srclns to NULL
for(int i = 0; i < MAXSRC; ++i)
{
srclns[i] = NULL;
}
Then rework the while loop to pass a new element of srclns in each iteration. Each call to getline will see an NULL pointer, thus allocating memory and updating the cell of srclns to point on it. Bonus with this implementation your certain of never going out of bounds of srclns:
for(int i = 0; i < MAXSRC; ++i)
{
n = 0
if(getline(&srclns[i], &n, stream) == -1)
{
break; // We get out if we reached OEF
}
}
Free all of this allocated memory in main after you accessed it for your printf
for(int i = 0; i < MAXSRC; ++i)
{
if(srclns[i] != NULL)
{
free(srclns[i]);
}
}
Adjust. I did no test on the code so I might have made some mistakes... feel free to correct it. You might also want to adjust the code to match your needs.
The function getline will only allocate memory if lnptr is NULL (ref). This is the case for the first iteration, but it will need to be reset to NULL afterwards:
while ((read = getline(&lnptr, &n, stream)) != -1) {
srclns[count++] = lnptr;
lnptr = NULL;
}
Otherwise, lnptr will still point to the memory allocated in the first iteration for all subsequent iterations and getline will repeatedly try to write to that location.
Even though it is not the cause of the problem, the allocated memory should be free'd. For example, by adding these lines before exit(EXIT_SUCCESS):
for (size_t i = 0; i<n; i++)
free(srclns[i]);
Whether or not using getline is a good practice is another discussion which you may want to look into. It is not the most portable solution.

How to fix "realloc(): invalid pointer"

I am trying to write a function to convert a text file into a CSV file.
The input file has 3 lines with space-delimited entries. I have to find a way to read a line into a string and transform the three lines from the input file to three columns in a CSV file.
The files look like this :
Jake Ali Maria
24 23 43
Montreal Johannesburg Sydney
And I have to transform it into something like this:
Jake, 24, Montreal
...etc
I figured I could create a char **line variable that would hold three references to three separate char arrays, one for each of the three lines of the input file. I.e., my goal is to have *(line+i) store the i+1'th line of the file.
I wanted to avoid hardcoding char array sizes, such as
char line1 [999];
fgets(line1, 999, file);
so I wrote a while loop to fgets pieces of a line into a small buffer array of predetermined size, and then strcat and realloc memory as necessary to store the line as a string, with *(line+i) as as pointer to the string, where i is 0 for the first line, 1 for the second, etc.
Here is the problematic code:
#include <stdio.h>
#include<stdlib.h>
#include<string.h>
#define CHUNK 10
char** getLines (const char * filename){
FILE *file = fopen(filename, "rt");
char **lines = (char ** ) calloc(3, sizeof(char*));
char buffer[CHUNK];
for(int i = 0; i < 3; i++){
int lineLength = 0;
int bufferLength = 0;
*(lines+i) = NULL;
do{
fgets(buffer, CHUNK, file);
buffLength = strlen(buffer);
lineLength += buffLength;
*(lines+i) = (char*) realloc(*(lines+i), (lineLength +1)*sizeof(char));
strcat(*(lines+i), buffer);
}while(bufferLength ==CHUNK-1);
}
puts(*(lines+0));
puts(*(lines+1));
puts(*(lines+2));
fclose(file);
}
void load_and_convert(const char* filename){
char ** lines = getLines(filename);
}
int main(){
const char* filename = "demo.txt";
load_and_convert(filename);
}
This works as expected only for i=0. However, going through this with GDB, I see that I get a realloc(): invalid pointer error. The buffer loads fine, and it only crashes when I call 'realloc' in the for loop for i=1, when I get to the second line.
I managed to store the strings like I wanted in a small example I did to try to see what was going on, but the inputs were all on the same line. Maybe this has to do with fgets reading from a new line?
I would really appreciate some help with this, I've been stuck all day.
Thanks a lot!
***edit
I tried as suggested to use calloc instead of malloc to initialize the variable **lines, but I still have the same issue.I have added the modifications to the original code I uploaded.
***edit
After deleting the file and recompiling, the above now seems to work. Thank you to everyone for helping me out!
You allocate line (which is a misnomer since it's not a single line), which is a pointer to three char*s. You never initialize the contents of line (that is, you never make any of those three char*s point anywhere). Consequently, when you do realloc(*(line + i), ...), the first argument is uninitialized garbage.
To use realloc to do an initial memory allocation, its first argument must be a null pointer. You should explicitly initialize each element of line to NULL first.
Additionally, *(line+i) = (char *)realloc(*(line+i), ...) is still bad because if realloc fails to allocate memory, it will return a null pointer, clobber *(line + i), and leak the old pointer. You instead should split it into separate steps:
char* p = realloc(line[i], ...);
if (p == null) {
// Handle failure somehow.
exit(1);
}
line[i] = p;
A few more notes:
In C, you should avoid casting the result of malloc/realloc/calloc. It's not necessary since C allows implicit conversion from void* to other pointer types, and the explicit could mask an error where you accidentally omit #include <stdlib.h>.
sizeof(char) is, by definition, 1 byte.
When you're allocating memory, it's safer to get into a habit of using T* p = malloc(n * sizeof *p); instead of T* p = malloc(n * sizeof (T));. That way if the type of p ever changes, you won't silently be allocating the wrong amount of memory if you neglect to update the malloc (or realloc or calloc) call.
Here, you have to zero your array of pointers (for example by using calloc()),
char **line = (char**)malloc(sizeof(char*)*3); //allocate space for three char* pointers
otherwise the reallocs
*(line+i) = (char *)realloc(*(line+i), (inputLength+1)*sizeof(char)); //+1 for the empty character
use an uninitialized pointer, leading to undefined behaviour.
That it works with i=0 is pure coindicence and is a typical pitfall when encountering UB.
Furthermore, when using strcat(), you have to make sure that the first parameter is already a zero-terminated string! This is not the case here, since at the first iteration, realloc(NULL, ...); leaves you with an uninitialized buffer. This can lead to strcpy() writing past the end of your allocated buffer and lead to heap corruption. A possible fix is to use strcpy() instead of strcat() (this should even be more efficient here):
do{
fgets(buffer, CHUNK, file);
buffLength = strlen(buffer);
lines[i] = realloc(lines[i], (lineLength + buffLength + 1));
strcpy(lines[i]+lineLength, buffer);
lineLength += buffLength;
}while(bufferLength ==CHUNK-1);
The check bufferLength == CHUNK-1 will not do what you want if the line (including the newline) is exactly CHUNK-1 bytes long. A better check might be while (buffer[buffLength-1] != '\n').
Btw. line[i] is by far better readable than *(line+i) (which is semantically identical).

Segmentation Fault on fputs

I am pretty new to C and memory allocation in general. Basically what I am trying to do is copy the contents of an input file of unknown size and reverse it's contents using recursion. I feel that I am very close, but I keep getting a segmentation fault when I try to put in the contents of what I presume to be the reversed contents of the file (I presume because I think I am doing it right....)
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int recursive_back(char **lines, int lineNumber, FILE *input) {
char *input_line = malloc(sizeof(char) * 1000);
lines = realloc(lines, (lineNumber) * 1000 * sizeof(char));
if(fgets(input_line, 201, input) == NULL) {
*(lines + lineNumber) = input_line;
return 1;
}
else {
printf("%d\n", lineNumber);
return (1+recursive_back(lines, ++lineNumber, input));
}
}
void backward (FILE *input, FILE *output, int debugflag ) {
int i;
char **lines; //store lines in here
lines = malloc(1000 * sizeof(char *) ); //1000 lines
if(lines == NULL) { //if malloc failed
fprintf(stderr, "malloc of lines failed\n");
exit(1);
}
int finalLineCount, lineCount;
finalLineCount = recursive_back(lines, 0, input);
printf("test %d\n", finalLineCount);
for(i = finalLineCount; i > 0; i--) {
fputs(*(lines+i), output); //segfault here
}
}
I am using a simple input file to test the code. My input file is 6 lines long that says "This is a test input file". The actual input files are being opened in another function and passed over to the backward function. I have verified that the other functions in my program work since I have been playing around with different options. These two functions are the only functions that I am having trouble with. What am I doing wrong?
Your problem is here:
lines = realloc(lines, (lineNumber) * 1000 * sizeof(char));
exactly as #ooga said. There are at least three separate things wrong with it:
You are reallocating the memory block pointed to by recursive_back()'s local variable lines, and storing the new address (supposing that the reallocation succeeds) back into that local variable. The new location is not necessarily the same as the old, but the only pointer to it is a local variable that goes out of scope at the end of recursive_back(). The caller's corresponding variable is not changed (including when the caller is recursive_back() itself), and therefore can no longer be relied upon to be a valid pointer after recursive_back() returns.
You allocate space using the wrong type. lines has type char **, so the object it points to has type char *, but you are reserving space based on the size of char instead.
You are not reserving enough space, at least on the first call, when lineNumber is zero. On that call, when the space requested is exactly zero bytes, the effect of the realloc() is to free the memory pointed to by lines. On subsequent calls, the space allocated is always one line's worth less than you think you are allocating.
It looks like the realloc() is altogether unnecessary if you can rely on the input to have at most 1000 lines, so you should consider just removing it. If you genuinely do need to be able to reallocate in a way that the caller will see, then the caller needs to pass a pointer to its variable, so that recursive_back() can modify it via that pointer.

Reading a line from file in C, dynamically

#include <stdio.h>
#include <stdlib.h>
int main()
{
FILE *input_f;
input_f = fopen("Input.txt", "r"); //Opens the file in read mode.
if (input_f != NULL)
{
char line[2048];
while( fgets(line, sizeof line, input_f) != NULL )
{
//do something
}
fclose(input_f); //Close the input file.
}
else
{
perror("File couldn't opened"); //Will print that file couldn't opened and why.
}
return 0;
}
Hi. I know I can read line by line with this code in C, but I don't want to limit line size, say like in this code with 2048.
I thought about using malloc, but I don't know the size of the line before I read it, so IMO it cannot be done.
Is there a way to not to limit line size?
This question is just for my curiosity, thank you.
When you are allocating memory dynamically, you will want to change:
char line[2048];
to
#define MAXL 2048 /* the use of a define will become apparent when you */
size_t maxl = MAXL; /* need to check to determine if a realloc is needed */
char *line = malloc (maxl * sizeof *line);
if (!line) /* always check to insure allocation succeeded */
...error.. memory allocation failed
You read read up to (maxl -1) chars or a newline (if using fgetc, etc..) or read the line and then check whether line [strlen (line) - 1] == '\n' to determine whether you read the entire line (if using fgets). (POSIX requires all lines terminate with a newline) If you read maxl characters (fgetc) or did not read the newline (fgets), then it is a short read and more characters remain. Your choice is to realloc (generally doubling the size) and try again. To realloc:
char *tmp = realloc (line, 2 * maxl)
if (tmp) {
line = tmp;
maxl *= 2;
}
Note: never reallocate using your original pointer (e.g. line = realloc (line, 2 * maxl) because if realloc fails, the memory is freed and the pointer set to NULL and you will lose any data that existed in line. Also note that maxl is typically doubled each time you realloc. However, you are free to choose whatever size increasing scheme you like. (If you are concerned about zeroing all new memory allocated, you can use memset to initialize the newly allocated space to zero/null. Useful in some situations where you want to insure your line is always null-terminated)
That is the basic dynamic allocation/reallocation scheme. Note you are reading until you read the complete line, so you will need to restructure your loop test. And lastly, since you allocated the memory, you are responsible for freeing the memory when you are done with it. A tool you cannot live without is valgrind (or similar memory checker) to confirm you are not leaking memory.
Tip if you are reading and want to insure your string is always null-terminated, then after allocating your block of memory, zero (0) all characters. As mentioned earlier, memset is available, but if you choose calloc instead of malloc it will zero the memory for you. However, on realloc the new space is NOT zero'ed either way, so calling memset is required regardless of what function originally allocated the block.
Tip2 Look at the POSIX getline. getline will handle the allocation/reallocation needed so long as line is initialized to NULL. getline also returns the number of characters actually read dispensing with the need to call strlen after fgets to determine the same.
Let me know if you have additional questions.
Consider 2 thoughts:
An upper bound of allocated memory is reasonable. The nature of the task should have some idea of a maximum line length, be it 80, 1024 or 1 Mbyte.
With a clever OS, actual usage of allocated memory may not occur until needed. See Why is malloc not "using up" the memory on my computer?
So let code allocate 1 big buffer to limit pathological cases and let the underlying memory management (re-)allocate real memory as needed.
#define N (1000000)
char *buf = malloc(N);
...
while (fgets(buf, N, stdin) != NULL)) {
size_t len = strlen(buf);
if (len == N-1) {
perror("Excessive Long Line");
exit(EXIT_FAILURE);
}
}
free(buf);

C: how to read in a variable amount of info from files and store it in array

I am not used to programming in c, so I am wondering how to have an array, and then read a variable amount of variables in a file, and those these files in the array.
//how do I declare an array whose sizes varies
do {
char buffer[1000];
fscanf(file, %[^\n]\n", buffer);
//how do i add buffer to array
}while(!feof(file));
int nlines = 0
char **lines = NULL; /* Array of resulting lines */
int curline = 0;
char buffer[BUFSIZ]; /* Just alloocate this once, not each time through the loop */
do {
if (fgets(buffer, sizeof buffer, file)) { /* fgets() is the easy way to read a line */
if (curline >= nlines) { /* Have we filled up the result array? */
nlines += 1000; /* Increase size by 1,000 */
lines = realloc(lines, nlines*sizeof(*lines); /* And grow the array */
}
lines[curline] = strdup(buffer); /* Make a copy of the input line and add it to the array */
curline++;
}
}while(!feof(file));
Arrays are always fixed-size in C. You cannot change their size. What you can do is make an estimate of how much space you'll need beforehand and allocate that space dynamically (with malloc()). If you happen to run out of space, you reallocate. See the documentation for realloc() for that. Basically, you do:
buffer = realloc(size);
The new size can be larger or smaller than what you had before (meaning you can "grow" or "shrink" the array.) So if at first you want, say, space for 5000 characters, you do:
char* buffer = malloc(5000);
If later you run out of space and want an additional 2000 characters (so the new size will be 7000), you would do:
buffer = realloc(7000);
The already existing contents of buffer are preserved. Note that realloc() might not be able to really grow the memory block, so it might allocate an entirely new block first, then copy the contents of the old memory to the new block, and then free the old memory. That means that if you stored a copy of the buffer pointer elsewhere, it will point to the old memory block which doesn't exist anymore. For example:
char* ptr = buffer;
buffer = realloc(7000);
At that point, ptr is only valid if ptr == buffer, which is not guaranteed to be the case.
It appears that you are trying to read until you read a newline.
The easiest way to do this is via getline.
char *buffer = NULL;
int buffer_len;
int ret = getline(&buffer, &buffer_len, file);
...this will read one line of text from the file file (unless ret is -1, in which there's an error or you're at the end of the file).
An array where the string data is in the array entry is usually a non-optimal choice. If the complete set of data will fit comfortably in memory and there's a reasonable upper bound on the number of entries, then a pointer-array is one choice.
But first, avoid scanf %s and %[] formats without explicit lengths. Using your example buffer size of 1000, the maximum string length that you can read is 999, so:
/* Some needed data */
int n;
struct ptrarray_t
{
char **strings;
int nalloc; /* number of string pointers allocated */
int nused; /* number of string pointers used */
} pa_hdr; /* presume this was initialized previously */
...
n = fscanf(file, "%999[\n]", buffer);
if (n!=1 || getc(file)!='\n')
{
there's a problem
}
/* Now add a string to the array */
if (pa_hdr.nused < pa_hdr.nalloc)
{
int len = strlen(buffer);
char *cp = malloc(len+1);
strcpy(cp, buffer);
pa_hdr.strings[pa_hdr.nused++] = cp;
}
A reference to any string hereafter is just pa_hdr.strings[i], and a decent design will use function calls or macros to manage the header, which in turn will be in a header file and not inline. When you're done with the array, you'll need a delete function that will free all of those malloc()ed pointers.
If there are a large number of small strings, malloc() can be costly, both in time and space overhead. You might manage pools of strings in larger blocks that will live nicely with the memory allocation and paging of the host OS. Using a set of functions to effectively make an object out of this string-array will help your development. You can pick a simple strategy, as above, and optimize the implementation later.

Resources