Reading a line from file in C, dynamically

Reading a line from file in C, dynamically - c

#include <stdio.h>
#include <stdlib.h>
int main()
{
FILE *input_f;
input_f = fopen("Input.txt", "r"); //Opens the file in read mode.
if (input_f != NULL)
{
char line[2048];
while( fgets(line, sizeof line, input_f) != NULL )
{
//do something
}
fclose(input_f); //Close the input file.
}
else
{
perror("File couldn't opened"); //Will print that file couldn't opened and why.
}
return 0;
}
Hi. I know I can read line by line with this code in C, but I don't want to limit line size, say like in this code with 2048.
I thought about using malloc, but I don't know the size of the line before I read it, so IMO it cannot be done.
Is there a way to not to limit line size?
This question is just for my curiosity, thank you.

When you are allocating memory dynamically, you will want to change:
char line[2048];
to
#define MAXL 2048 /* the use of a define will become apparent when you */
size_t maxl = MAXL; /* need to check to determine if a realloc is needed */
char *line = malloc (maxl * sizeof *line);
if (!line) /* always check to insure allocation succeeded */
...error.. memory allocation failed
You read read up to (maxl -1) chars or a newline (if using fgetc, etc..) or read the line and then check whether line [strlen (line) - 1] == '\n' to determine whether you read the entire line (if using fgets). (POSIX requires all lines terminate with a newline) If you read maxl characters (fgetc) or did not read the newline (fgets), then it is a short read and more characters remain. Your choice is to realloc (generally doubling the size) and try again. To realloc:
char *tmp = realloc (line, 2 * maxl)
if (tmp) {
line = tmp;
maxl *= 2;
}
Note: never reallocate using your original pointer (e.g. line = realloc (line, 2 * maxl) because if realloc fails, the memory is freed and the pointer set to NULL and you will lose any data that existed in line. Also note that maxl is typically doubled each time you realloc. However, you are free to choose whatever size increasing scheme you like. (If you are concerned about zeroing all new memory allocated, you can use memset to initialize the newly allocated space to zero/null. Useful in some situations where you want to insure your line is always null-terminated)
That is the basic dynamic allocation/reallocation scheme. Note you are reading until you read the complete line, so you will need to restructure your loop test. And lastly, since you allocated the memory, you are responsible for freeing the memory when you are done with it. A tool you cannot live without is valgrind (or similar memory checker) to confirm you are not leaking memory.
Tip if you are reading and want to insure your string is always null-terminated, then after allocating your block of memory, zero (0) all characters. As mentioned earlier, memset is available, but if you choose calloc instead of malloc it will zero the memory for you. However, on realloc the new space is NOT zero'ed either way, so calling memset is required regardless of what function originally allocated the block.
Tip2 Look at the POSIX getline. getline will handle the allocation/reallocation needed so long as line is initialized to NULL. getline also returns the number of characters actually read dispensing with the need to call strlen after fgets to determine the same.
Let me know if you have additional questions.

Consider 2 thoughts:
An upper bound of allocated memory is reasonable. The nature of the task should have some idea of a maximum line length, be it 80, 1024 or 1 Mbyte.
With a clever OS, actual usage of allocated memory may not occur until needed. See Why is malloc not "using up" the memory on my computer?
So let code allocate 1 big buffer to limit pathological cases and let the underlying memory management (re-)allocate real memory as needed.
#define N (1000000)
char *buf = malloc(N);
...
while (fgets(buf, N, stdin) != NULL)) {
size_t len = strlen(buf);
if (len == N-1) {
perror("Excessive Long Line");
exit(EXIT_FAILURE);
}
}
free(buf);

Related

Saving getline() output to an external array

The external array srclns should keep each read line from a text file. But reading it's content afterwards seems like read lines are empty strings. What am I missing in the code below?
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#define MAXSRC 20
char *srclns[MAXSRC]; /* source lines */
size_t read_lines(char *path)
{
FILE *stream;
ssize_t read;
char *lnptr;
size_t n;
size_t count;
stream = fopen(path, "r");
lnptr = NULL;
n = 0;
count = 0;
if (!stream) {
fprintf(stderr, "Can't open source '%s'\n", path);
exit(EXIT_FAILURE);
}
while ((read = getline(&lnptr, &n, stream)) != -1) {
srclns[count++] = lnptr;
}
free(lnptr);
fclose(stream);
return count;
}
int main()
{
size_t n = read_lines("foo.txt");
for (size_t i = 0; i<n; i++)
printf("%zu %s\n", i, srclns[i]);
exit(EXIT_SUCCESS);
}
This prints only the line numbers with seemingly empty strings afterwards:
0
1
2
3
4
5

So from what I can see not only does your program not work but it might have memory leaks. This is due to the behavior of getline which uses dynamic allocation.
Let's take a closer look at what your program does, in particular the while ((read = getline(&lnptr, &n, stream)) != -1) loop:
getline will work with &lnptr which is of type char**.
If the pointer is NULL it will allocate enough memory on heap (dynamic) to store the line that is being read.
If the pointer is not NULL then it is expected to point on a buffer of size n
If the buffer is big enough (greater or equal to the line length) it is used to store the string.
If the buffer is too small then memory is reallocated by getline so there is a big enough buffer available. Upon reallocation, n is updated to the new buffer size. And in certain cases reallocation will imply that lnptr has to be modified and will be. (This might happen if there is not enough consecutive memory free riIght after the current buffer. In that case the memory will be allocated somewhere else on heap. If this is of interest to you I suggest you research is because dynamic memory allocation is a rather complex topic, else just know the pointer might change, that's enough for now).
Now here are the issues with your program (at least this is what I can infer from the information I have. I might be wrong but this seems the most plausible interpretation):
On the first iteration of the loop lnptr is NULL. Thus getline allocates memory on heap and stores the line, and update lnptr to point on the newly allocated buffer.
Within the loop you store the pointer to the allocated buffer in srclns[0]
On the subsequent iterations the buffer is overwritten and maybe resized by getline, and you still store the pointer to the same buffer srclns[count].
After the loop you free the buffer and discard the memory every pointer in srclns points to.
When you print you most likely read an invalid memory zone (which is the zone pointed by the pointer you just freed) and luckily it seems to start with an termination character (Last line of your file was probably an empty line and nothing actively changed this memory zone after the free...)
How to fix it:
You could explicitly handle dynamic allocation with malloc and/or calloc but that seem a bit complicated and, as shown before, getline can handle it for you. My suggestion is as follow:
Set all your elements in srclns to NULL
for(int i = 0; i < MAXSRC; ++i)
{
srclns[i] = NULL;
}
Then rework the while loop to pass a new element of srclns in each iteration. Each call to getline will see an NULL pointer, thus allocating memory and updating the cell of srclns to point on it. Bonus with this implementation your certain of never going out of bounds of srclns:
for(int i = 0; i < MAXSRC; ++i)
{
n = 0
if(getline(&srclns[i], &n, stream) == -1)
{
break; // We get out if we reached OEF
}
}
Free all of this allocated memory in main after you accessed it for your printf
for(int i = 0; i < MAXSRC; ++i)
{
if(srclns[i] != NULL)
{
free(srclns[i]);
}
}
Adjust. I did no test on the code so I might have made some mistakes... feel free to correct it. You might also want to adjust the code to match your needs.

The function getline will only allocate memory if lnptr is NULL (ref). This is the case for the first iteration, but it will need to be reset to NULL afterwards:
while ((read = getline(&lnptr, &n, stream)) != -1) {
srclns[count++] = lnptr;
lnptr = NULL;
}
Otherwise, lnptr will still point to the memory allocated in the first iteration for all subsequent iterations and getline will repeatedly try to write to that location.
Even though it is not the cause of the problem, the allocated memory should be free'd. For example, by adding these lines before exit(EXIT_SUCCESS):
for (size_t i = 0; i<n; i++)
free(srclns[i]);
Whether or not using getline is a good practice is another discussion which you may want to look into. It is not the most portable solution.

C - insert lines from very big file into array

First of all, I'm quite new with C, and I know this is a very repeated question, however, I could not find anything that could help me with my problem.
Here is my code:
It takes a text file and stores each line in an array.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
FILE *file;
file = fopen("test.txt", "r");
char buffer[600];
char *lines[10000];
int i = 0;
while(fgets(buffer, sizeof(buffer), file))
{
lines[i] = malloc(sizeof(buffer));
strcpy(lines[i], buffer);
i++;
free(lines[i]);
}
fclose(file);
return 1;
}
This works fine for small text files.
However it doesn't with large ones (even setting buffer and lines with much bigger numbers). Actually, if I increment buffer[] and *lines[] like 1000000 bytes, it doesn't give anything (if I understood well, it gives undefined behaviour). And I need to get this work with a 100.000 lines file with variable length lines,
So, how could I declare a very large array so I can pass each line? Since, as I exposed, it doesn't work with a large file.
Any help is appreciated!

char *lines[10000]; is just an array of pointers to the lines, not the array (memory) that is going to store the actual lines.
malloc is allocating a chunk of memory for each line, you are suppose to call free only when you are done using this chunk.
If you remove the free your solution would work, but you need to remember to free at some other point.

And I need to get this work with a 100.000 lines file with variable length lines,
So, how could I declare a very large array so I can pass each line?
This line
char *lines[10000];
gives you a variable with automatic storage duration - often called a local variable.
On most systems such a variable are located on a stack and most systems have a fixed limit for the size of the stack and thereby also a limit for the size of such a local variable.
So if you change the code to
char *lines[1000000];
to be able to handle larger files, it is likely that the variable use too much memory on the stack, i.e. you have a stack overflow.
A simple solution is to allocate the variable dynamically. Like:
char **lines = malloc(1000000 * sizeof *lines);
This will allocate 1000000 char-pointers and you can use lines as if it's an array - for instance like:
lines[i] = malloc(sizeof(buffer));
For something like this I'll also recommend that you take a look at realloc so that you can adjust the size of memory as needed.
Besides that your use of free seems strange and it's for sure wrong as you increment i between the malloc and the free.

You can allocate any space just as big as you need. So you will get rid of the fixed and limited numbers.
I have "massaged" your example in this way. The only thing I didn't is a first round through the file to obtain the longest line. So I kept the fixed buffer length.
Allocate only as many pointer to the lines as you need. For this you define a pointer to pointers to char.
Allocate only as many characters for each line as you need. This is done most conveniently with the function strdup(). If your library doesn't have it (it is not standard) you can replace it with the right combination of strlen(), malloc(), and strcpy(). How to do this is left as an exercise for you. ;-)
Handle allocation errors, especially if you plan to read huge files.
Free the allocated memories blocks, the sequence for the lines is not important. But lines has to be kept until all lines[*] are freed.
This is the code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
FILE *file;
file = fopen("test.txt", "r");
char buffer[600];
char **lines = NULL;
int i = 0;
while (fgets(buffer, sizeof(buffer), file))
{
lines = realloc(lines, (i + 1) * sizeof (char*));
if (lines == NULL)
{
// any error handling you like...
return EXIT_FAILURE;
}
lines[i] = strdup(buffer);
if (lines[i] == NULL)
{
// any error handling you like...
return EXIT_FAILURE;
}
i++;
}
fclose(file);
// work with the lines
for (int j = 0; j < i; ++j)
{
free(lines[j]);
}
free(lines);
return EXIT_SUCCESS;
}
Some notes:
Because of the realloc() on each line the run time of you program will scale bad for files with a giant number of lines. To improve this you might like to use some better algorithm, for example by allocating in steps of growing numbers. But this is a completely different issue.
You don't need to free allocated memory yourself at all if you need the memory until the end of the program. The C runtime will then free the memory automatically.

getline() / strsep() combination causes segmentation fault

I'm getting a segmentation fault when running the code below.
It should basically read a .csv file with over 3M lines and do other stuff afterwards (not relevant to the problem), but after 207746 iterations it returns a segmentation fault. If I remove the p = strsep(&line,"|"); and just print the whole line it will print the >3M lines.
int ReadCSV (int argc, char *argv[]){
char *line = NULL, *p;
unsigned long count = 0;
FILE *data;
if (argc < 2) return 1;
if((data = fopen(argv[1], "r")) == NULL){
printf("the CSV file cannot be open");
exit(0);
}
while (getline(&line, &len, data)>0) {
p = strsep(&line,"|");
printf("Line number: %lu \t p: %s\n", count, p);
count++;
}
free(line);
fclose(data);
return 0;
}
I guess it'd have to do with the memory allocation, but can't figure out how to fix it.

A combination of getline and strsep often causes confusion, because both functions change the pointer that you pass them by pointer as the initial argument. If you pass the pointer that has been through strsep to getline again, you run the risk of undefined behavior on the second iteration.
Consider an example: getline allocates 101 bytes to line, and reads a 100-character string into it. Note that len is now set to 101. You call strsep, which finds '|' in the middle of the string, so it points line to what used to be line+50. Now you call getline again. It sees another 100-character line, and concludes that it is OK to copy it into the buffer, because len is still 101. However, since line points to the middle of the buffer now, writing 100 characters becomes undefined behavior.
Make a copy of line before calling strsep:
while (getline(&line, &len, data)>0) {
char *copy = line;
p = strsep(&copy, "|");
printf("Line number: %lu \t p: %s\n", count, p);
count++;
}
Now line that you pass to getline is preserved between loop iterations.

Look at the expression getline(&line, &len, data) and read the manpage:
If *line is set to NULL and *len is set 0 before the call, then
getline() will allocate a buffer for storing the line. This buffer
should be freed by the user program even if getline() failed.
This should be the case on your first time round the loop (although we can't see where len is declared, let's just assume your real code does this correctly)
Alternatively, before calling getline(), *line can contain a
pointer to a malloc(3)-allocated buffer *len bytes in size. If the
buffer is not large enough to hold the line, getline() resizes it
with realloc(3), updating *line and *len as necessary.
OK, so if line != NULL it must point to a buffer allocated by malloc of size len. The buffer allocated by your first call to getline (as above) satisfies this.
Note it's not good enough for line to point somewhere into that buffer, it must be the beginning.
Now look at the expression strsep(&line,"|") and read the manpage for that:
... This token is terminated by overwriting the delimiter with a
null byte ('\0'), and *line is updated to point past the token
So, the first argument (line) is changed so that you can call strsep again with the same first argument, and get the next token. This means line is no longer a valid argument to getline, because it isn't the start of a malloc'd buffer (and the length len is also now wrong).
In practice, either
getline will try to read len bytes into the buffer you gave it, but since you advanced line by the length of the first token, it writes off the end of your allocated block. This might just damage the heap rather than dying immediately
getline will try to realloc the buffer you gave it, but since it isn't a valid allocated block, you get heap damage again.
While we're here, you also don't check p is non-NULL, but damaging line is the main problem.
Oh, and if you think the problem is allocation-related, try using valgrind - it generally finds the moment things first go wrong.

obstack, gets and getline

I am trying to get a line from stdin. as far as I understand, we should never use gets as said in man page of gets:
Never use gets(). Because it is impossible to tell without knowing
the data in advance how many characters gets() will read, and
because gets() will continue to store characters past the end of the
buffer, it is extremely dangerous to use. It has been used to
break computer security. Use fgets() instead.
it suggests that we can use fgets() instead. the problem with fgets() is that we don't know the size of the user input in advance and fgets() read exactly one less than size bytes from the stream as man said:
fgets() reads in at most one less than size characters from stream
and stores them into the buffer pointed to by s. Reading stops
after an EOF or a newline. If a newline is read, it is stored into
the buffer. A terminating null byte ('\0') is stored after the last
character in the buffer.
There is also another approach which is using POSIX getline() which uses realloc to update the buffer size so we can read any string with arbitrary length from input stream as man said:
Alternatively, before calling getline(), *lineptr can contain a
pointer to a malloc(3)-allocated buffer *n bytes in size. If the
buffer is not large enough to hold the line, getline() resizes it
with realloc(3), updating *lineptr and *n as necessary.
and finally there is another approach which is using obstack as libc manual said:
Aside from this one constraint of order of freeing, obstacks are
totally general: an obstack can contain any number of objects of
any size. They are implemented with macros, so allocation is
usually very fast as long as the objects are usually small. And the
only space overhead per object is the padding needed to start each
object on a suitable boundary...
So we can use obstack for any object of any size an allocation is very fast with a little space overhead which is not a big deal. I wrote this code to read input string without knowing the length of it.
#include <stdio.h>
#include <stdlib.h>
#include <obstack.h>
#define obstack_chunk_alloc malloc
#define obstack_chunk_free free
int main(){
unsigned char c;
struct obstack * mystack;
mystack = (struct obstack *) malloc(sizeof(struct obstack));
obstack_init(mystack);
c = fgetc(stdin);
while(c!='\r' && c!='\n'){
obstack_1grow(mystack,c);
c = fgetc(stdin);
}
printf("the size of the stack is: %d\n",obstack_object_size(mystack));
printf("the input is: %s\n",(char *)obstack_finish(mystack));
return 0;
}
So my question is :
Is it safe to use obstack like this?
Is it like using POSIX getline?
Am I missing something here? any drawbacks?
Why shouldn't I using it?
thanks in advance.

fgets has no drawbacks over gets. It just forces you to acknowledge that you must know the size of the buffer. gets instead requires you to somehow magically know beforehand the length of the input a (possibly malicious) user is going to feed into your program. That is why gets was removed from the C programming language. It is now non-standard, while fgets is standard and portable.
As for knowing the length of the line beforehand, POSIX says that an utility must be prepared to handle lines that fit in buffers that are of LINE_MAX size. Thus you can do:
char line[LINE_MAX];
while (fgets(line, LINE_MAX, fp) != NULL)
and any file that produces problems with that is not a standard text file. In practice everything will be mostly fine if you just don't blindly assume that the last character in the buffer is always '\n' (which it isn't).
getline is a POSIX standard function. obstack is a GNU libc extension that is not portable. getline was built for efficient reading of lines from files, obstack was not, it was built to be generic. With obstack, the string is not properly contiguous in memory / in its final place, until you call obstack_finish.
Use getline if on POSIX, use fgets in programs that need to be maximally portable; look for an emulation of getline for non-POSIX platforms built on fgets.

Why shouldn't I using it?
Well, you shouldn't use getline() if you care about portability. You should use getline() if you're specifically targeting only POSIX systems.
As for obstacks, they're specific to the GNU C library, which might already be a strong reason to avoid them (it further restricts portability). Also, they're not meant to be used for this purpose.
If you aim for portability, just use fgets(). It's not too complicated to write a function similar to getline() based on fgets() -- here's an example:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define CHUNKSIZE 1024
char *readline(FILE *f)
{
size_t bufsize = CHUNKSIZE;
char *buf = malloc(bufsize);
if (!buf) return 0;
char *pos = buf;
size_t len = 0;
while (fgets(pos, CHUNKSIZE, f))
{
char *nl = strchr(pos, '\n');
if (nl)
{
// newline found, replace with string terminator
*nl = '\0';
char *tmp = realloc(buf, len + strlen(pos) + 1);
if (tmp) return tmp;
return buf;
}
// no newline, increase buffer size
len += strlen(pos);
char *tmp = realloc(buf, len + CHUNKSIZE);
if (!tmp)
{
free(buf);
return 0;
}
buf = tmp;
pos = buf + len;
}
// handle case when input ends without a newline
char *tmp = realloc(buf, len + 1);
if (tmp) return tmp;
return buf;
}
int main(void)
{
char *input = readline(stdin);
if (!input)
{
fputs("Error reading input!\n", stderr);
return 1;
}
puts(input);
free(input);
return 0;
}
This one removes the newline if it was found and returns a newly allocated buffer (which the caller has to free()). Adapt to your needs. It could be improved by increasing the buffer size only when the buffer was filled completely, with just a bit more code ...

C: how to read in a variable amount of info from files and store it in array

I am not used to programming in c, so I am wondering how to have an array, and then read a variable amount of variables in a file, and those these files in the array.
//how do I declare an array whose sizes varies
do {
char buffer[1000];
fscanf(file, %[^\n]\n", buffer);
//how do i add buffer to array
}while(!feof(file));

int nlines = 0
char **lines = NULL; /* Array of resulting lines */
int curline = 0;
char buffer[BUFSIZ]; /* Just alloocate this once, not each time through the loop */
do {
if (fgets(buffer, sizeof buffer, file)) { /* fgets() is the easy way to read a line */
if (curline >= nlines) { /* Have we filled up the result array? */
nlines += 1000; /* Increase size by 1,000 */
lines = realloc(lines, nlines*sizeof(*lines); /* And grow the array */
}
lines[curline] = strdup(buffer); /* Make a copy of the input line and add it to the array */
curline++;
}
}while(!feof(file));

Arrays are always fixed-size in C. You cannot change their size. What you can do is make an estimate of how much space you'll need beforehand and allocate that space dynamically (with malloc()). If you happen to run out of space, you reallocate. See the documentation for realloc() for that. Basically, you do:
buffer = realloc(size);
The new size can be larger or smaller than what you had before (meaning you can "grow" or "shrink" the array.) So if at first you want, say, space for 5000 characters, you do:
char* buffer = malloc(5000);
If later you run out of space and want an additional 2000 characters (so the new size will be 7000), you would do:
buffer = realloc(7000);
The already existing contents of buffer are preserved. Note that realloc() might not be able to really grow the memory block, so it might allocate an entirely new block first, then copy the contents of the old memory to the new block, and then free the old memory. That means that if you stored a copy of the buffer pointer elsewhere, it will point to the old memory block which doesn't exist anymore. For example:
char* ptr = buffer;
buffer = realloc(7000);
At that point, ptr is only valid if ptr == buffer, which is not guaranteed to be the case.

It appears that you are trying to read until you read a newline.
The easiest way to do this is via getline.
char *buffer = NULL;
int buffer_len;
int ret = getline(&buffer, &buffer_len, file);
...this will read one line of text from the file file (unless ret is -1, in which there's an error or you're at the end of the file).

An array where the string data is in the array entry is usually a non-optimal choice. If the complete set of data will fit comfortably in memory and there's a reasonable upper bound on the number of entries, then a pointer-array is one choice.
But first, avoid scanf %s and %[] formats without explicit lengths. Using your example buffer size of 1000, the maximum string length that you can read is 999, so:
/* Some needed data */
int n;
struct ptrarray_t
{
char **strings;
int nalloc; /* number of string pointers allocated */
int nused; /* number of string pointers used */
} pa_hdr; /* presume this was initialized previously */
...
n = fscanf(file, "%999[\n]", buffer);
if (n!=1 || getc(file)!='\n')
{
there's a problem
}
/* Now add a string to the array */
if (pa_hdr.nused < pa_hdr.nalloc)
{
int len = strlen(buffer);
char *cp = malloc(len+1);
strcpy(cp, buffer);
pa_hdr.strings[pa_hdr.nused++] = cp;
}
A reference to any string hereafter is just pa_hdr.strings[i], and a decent design will use function calls or macros to manage the header, which in turn will be in a header file and not inline. When you're done with the array, you'll need a delete function that will free all of those malloc()ed pointers.
If there are a large number of small strings, malloc() can be costly, both in time and space overhead. You might manage pools of strings in larger blocks that will live nicely with the memory allocation and paging of the host OS. Using a set of functions to effectively make an object out of this string-array will help your development. You can pick a simple strategy, as above, and optimize the implementation later.