Reading a file to char array then malloc size. (C) - c

Hey, so lets say I get a file as the first command line argument.
int main(int argc, char** argv) {
unsigned char* fileArray;
FILE* file1 = fopen(argv[1], "r");
}
Now how can I go about reading that file, char by char, into the char* fileArray?
Basically how can I convert a FILE* to a char* before I know how big I need to malloc the char*
I know a possible solution is to use a buffer, but my problem here is I'm dealing with files that could have over 900000 chars, and don't see it fit making a buffer that is that large.

If only "real" files (not stream, devices, ...) are used, you can use stat/fstat or something like
int retval=fseek(file1,0,SEEK_END); // succeeded if ==0 (file seekable, etc.)
long size=ftell(file1); // size==-1 would be error
rewind(file1);
to get the file's size beforehand. Then you can malloc and read.
But since file1 might change in the meantime you still have to ensure not to read beyond your malloced size.

There are a couple of approaches you can take:
specify a maximum size that you can handle, then you just allocate once (whether as a global or on the heap).
handle the file in chunks if you're worried about fitting it all into memory at once.
handle an arbitrary size by using malloc with realloc (as you read bits in).
Number 1 is easy:
static char buff[900001]; // or malloc/free of 900000
count = fread (buff, 1, 900001, fIn);
if (count > 900000) // problem!
Number 2 is probably the best way to do it unless you absolutely need the whole file in memory at once. For example, if your program counts the number of words, it can sequentially process the file a few K at a time.
Number 3, you can maintain a buffer, used and max variable. Initially set max to 50K and allocate buffer as that size.
Then try read in one 10K chunk to a fixed buffer tbuff. Add up the current used and the number of bytes read into tbuff and, if that's greater than max, do a realloc to increase buffer by another 50K (adjusting max at the same time).
Then append tbuff to buffer, adjust used, rinse and repeat. Note that all those values (10K, 50K and so on) are examples only. There are different values you can use depending on your needs.

Related

Will assigning a large value for length of char string be an issue?

I am reading a line from a file and I do not know the length it is going to be. I know there are ways to do this with pointers but I am specifically asking for just a plan char string. For Example if I initialize the string like this:
char string[300]; //(or bigger)
Will having large string values like this be a problem?
Any hard coded number is potentially too small to read the contents of a file. It's best to compute the size at run time, allocate memory for the contents, and then read the contents.
See Read file contents with unknown size.
char string[300]; //(or bigger)
I am not sure which of the two issues you are concerned with, so I will try to address both below:
if the string in the file is larger than 300 bytes and you try to "stick" that string in that buffer, without accounting the max length of your array -you will get undefined behaviour because of overwriting the array.
If you are just asking if 300 bytes is too much too allocate - then no, it is not a big deal unless you are on some very restricted device. e.g. In Visual Studio the default stack size (where that array would be stored) is 1 MB if I am not wrong. Benefits of doing so is understandable, e.g. you don't need to concern yourself with freeing it etc.
PS. So if you are sure the buffer size you specify is enough - this can be fine approach as you free yourself from memory management related issues - which you get from pointers and dynamic memory.
Will having large string values like this be a problem?
Absolutely.
If your application must read the entire line from a file before processing it, then you have two options.
1) Allocate buffer large enough to hold the line of maximum allowed length. For example, the SMTP protocol does not allow lines longer than 998 characters. In that case you can allocate a static buffer of length 1001 (998 + \r + \n + \0). Once you have read a line from a file (or from a client, in the example context) which is longer than the maximum length (that is, you have read 1000 characters and the last one is not \n), you can treat it as a fatal (protocol) error and report it.
2) If there are no limitations on the length of the input line, the only thing you can do to ensure your program robustness is allocating buffers dynamically as the input is read. This may involve storing multiple malloc-ed buffers in a linked list, or calling realloc when buffer exhaustion detected (this is how getline function works, although it is not specified in the C standard, only in POSIX.1-2008).
In either case, never use gets to read the line. Call fgets instead.
It all depends on how you read the line. For example:
char string[300];
FILE* fp = fopen(filename, "r");
//Error checking omitted
fgets(string, 300, fp);
Taken from tutorialspoint.com
The C library function char *fgets(char *str, int n, FILE *stream) reads a line from the specified stream and stores it into the string pointed to by str. It stops when either (n-1) characters are read, the newline character is read, or the end-of-file is reached, whichever comes first.
That means that this will read 299 characters from the file at most. This will cause only a logical error (because you might not get all the data you need) that won't cause any undefined behavior.
But, if you do:
char string[300];
int i = 0;
FILE* fp = fopen(filename, "r");
do{
string[i] = fgetc(fp);
i++;
while(string[i] != '\n');
This will cause Segmantation Fault because it will try to write on unallocated memory on lines bigger than 300 characters.

How to save a specific length string from a file and work with it in C

So what I'm trying to do is open a file and read it until the end in blocks that are 256 bytes long each time it is called. My dilemma is using fgets() or fread() to do it.
I was using fgets() initially, because it returns a string of the bytes that were read, which is great because I can store that data and work with it. However, in my particular file that I'm reading, the 256 bytes often happen over a more than 2 lines, which is a problem because fgets() stops reading when it hits a newline character or the end of the file.
I then thought of using fread(), but I don't know how to save the line that I'm referring to with it because fread() returns an int referring to the number of elements successfully read (according to its documentation).
I've searched and thought of solutions for a while now and can't find anything that works with my particular scenario. I would like some guidance on how to go about this issue, how would you go about this in my position?
You can use fread() to read each 256 bytes block and keep a lineCount variable to keep track of the number of new line characters you have encountered so far in the input. Since you have to process the blocks already this wouldn't mean much of an overhead in the processing.
To read a block of 256 chars, which is what I think you are doing, you just need to create a buffer of chars that can hold 256 of them, in other words a char array of size 256.
#define BLOCK_SIZE 256
char block[BLOCK_SIZE];
Then if you check the documentation for fread() it shows the following signature:
Following is the declaration for fread() function.
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream)
Parameters
ptr -- This is the pointer to a block of memory with a minimum size of size*nmemb bytes.
size -- This is the size in bytes of each element to be read.
nmemb -- This is the number of elements, each one with a size of size bytes.
stream -- This is the pointer to a FILE object that an input stream.
So this means it takes a pointer to the buffer where it will write the read information, the size of each element it's supposed to read, the maximum amount of elements you want it to read and the file pointer. In your case it would be:
int read = fread(block, sizeof(char), BLOCK_SIZE, file);
This will copy the information from the file to the block array, which you can later process and keep track of the lines. The characters that were read by fread are in the block array, so the first char in the last read block would be block[0], the second block[1] and so on. The returned value in read indicates how many elements (in your case chars) were inserted in the array block when you call fread, this number will be equal to BLOCK_SIZE for every call, unless you reach the end of the file or there's an error.
I suggest you read some documentation for a full example, play a little with the code and do some reading on pointers in C to gain a better understanding of how everything works in general. If you still have questions after that, we can take it from there or you can create a new SO question.

Effective methods for reading and writing large files in C

I'm writing an application that deals with very large user-generated input files. The program will copy about 95 percent of the file, effectively duplicating it and switching a few words and values in the copy, and then appending the copy (in chunks) to the original file, such that each block (consisting of between 10 and 50 lines) in the original is followed by the copied and modified block, and then the next original block, and so on. The user-generated input conforms to a certain format, and it is highly unlikely that any line in the original file is longer than 100 characters long.
Which would be the better approach?
To use one file pointer and use variables that hold the current position of how much has been read and where to write to, seeking the file pointer back and forth to read and write; or
To use multiple file pointers, one for reading and one for writing.
I am mostly concerned with the efficiency of the program, as the input files will reach up to 25,000 lines, each about 50 characters long.
If you have memory constraints, or you want a generic approach, read bytes into a buffer from one file pointer, make changes, and write out the buffer to a second file pointer when the buffer is full. If you reach EOF on the first pointer, make your changes and just flush whatever is in the buffer to the output pointer. If you intend to replace the original file, copy the output file to the input file and remove the output file. This "atomic" approach lets you check that the copy operation took place correctly before deleting anything.
For example, to deal with generically copying over any number of bytes, say, 1 MiB at a time:
#define COPY_BUFFER_MAXSIZE 1048576
/* ... */
unsigned char *buffer = NULL;
buffer = malloc(COPY_BUFFER_MAXSIZE);
if (!buffer)
exit(-1);
FILE *inFp = fopen(inFilename, "r");
fseek(inFp, 0, SEEK_END);
uint64_t fileSize = ftell(inFp);
rewind(inFp);
FILE *outFp = stdout; /* change this if you don't want to write to standard output */
uint64_t outFileSizeCounter = fileSize;
/* we fread() bytes from inFp in COPY_BUFFER_MAXSIZE increments, until there is nothing left to fread() */
do {
if (outFileSizeCounter > COPY_BUFFER_MAXSIZE) {
fread(buffer, 1, (size_t) COPY_BUFFER_MAXSIZE, inFp);
/* -- make changes to buffer contents at this stage
-- if you resize the buffer, then copy the buffer and
change the following statement to fwrite() the number of
bytes in the copy of the buffer */
fwrite(buffer, 1, (size_t) COPY_BUFFER_MAXSIZE, outFp);
outFileSizeCounter -= COPY_BUFFER_MAXSIZE;
}
else {
fread(buffer, 1, (size_t) outFileSizeCounter, inFp);
/* -- make changes to buffer contents at this stage
-- again, make a copy of buffer if it needs resizing,
and adjust the fwrite() statement to change the number
of bytes that need writing */
fwrite(buffer, 1, (size_t) outFileSizeCounter, outFp);
outFileSizeCounter = 0ULL;
}
} while (outFileSizeCounter > 0);
free(buffer);
An efficient way to deal with a resized buffer is to keep a second pointer, say, unsigned char *copyBuffer, which is realloc()-ed to twice the size, if necessary, to deal with accumulated edits. That way, you keep expensive realloc() calls to a minimum.
Not sure why this got downvoted, but it's a pretty solid approach for doing things with a generic amount of data. Hope this helps someone who comes across this question, in any case.
25000 lines * 100 characters = 2.5MB, that's not really a huge file. The fastest will probably be to read the whole file in memory and write your results to a new file and replace the original with that.

C using fread to read an unknown amount of data

I have a text file called test.txt
Inside it will be a single number, it may be any of the following:
1
2391
32131231
3123121412
I.e. it could be any size of number, from 1 digit up to x digits.
The file will only have 1 thing in it - this number.
I want a bit of code using fread() which will read that number of bytes from the file and put it into an appropriately sized variable.
This is to run on an embedded device; I am concerned about memory usage.
How to solve this problem?
You can simply use:
char buffer[4096];
size_t nbytes = fread(buffer, sizeof(char), sizeof(buffer), fp);
if (nbytes == 0)
...EOF or other error...
else
...process nbytes of data...
Or, in other words, provide yourself with a data space big enough for any valid data and then record how much data was actually read into the string. Note that the string will not be null terminated unless either buffer contained all zeroes before the fread() or the file contained a zero byte. You cannot rely on a local variable being zeroed before use.
It is not clear how you want to create the 'appropriately sized variable'. You might end up using dynamic memory allocation (malloc()) to provide the correct amount of space, and then return that allocated pointer from the function. Remember to check for a null return (out of memory) before using it.
If you want to avoid over-reading, fread is not the right function. You probably want fscanf with a conversion specifier along the lines of %100[0123456789]...
One way to achieve this is to use fseek to move your file stream location to the end of the file:
fseek(file, SEEK_END, SEEK_SET);
and then using ftell to get the position of the cursor in the file — this returns the position in bytes so you can then use this value to allocate a suitably large buffer and then read the file into that buffer.
I have seen warnings saying this may not always be 100% accurate but I've used it in several instances without a problem — I think the issues could be dependant on specific implementations of the functions on certain platforms.
Depending on how clever you need to be with the number conversion... If you do not need to be especially clever and fast, you can read it a character at a time with getc(). So,
- start with a variable initialized to 0.
- Read a character, multiply variable by 10 and add new digit.
- Then repeat until done.
Get a bigger sized variable as needed along the way or start with your largest sized variable and then copy it into the smallest size that fits after you finish.

Getting the input strings into an array in C

In C, we do something like:
int main(int argc, char **argv) {
printf("The first argument is %s", argv[1]);
printf("The second argument is %s", argv[2]);
return 0;
}
I was wondering if it's possible to store strings in an array in similar way as above when using scanf or fgets .
I tried like:
char **input;
scanf("%s", &input);
Anyway I can access the strings entered as input[0], input[1].. so on...
Yes, but you need to make sure you have enough space to do so:
char input[3][50]; // enough space for 3 strings with
// a length of 50 (including \0)
fgets(&input[0], 50, stdin);
printf("Inputted string: %s\n", input[0]);
Using char **input does not have any space allocated for the input, therefore you cannot do it.
It's possible, but somewhat tedious, especially if you don't know the number of strings at the beginning.
char **input;
That much is fine. From there, you need allocate an array of (the right number of) pointers:
input = malloc(sizeof(char *) * MAX_LINES);
Then you need to allocate space for each line. Since you typically only want enough space for each string, you typically do something like this:
#define MAX_LINE_LEN 8192
static char buffer[MAX_LINE_LEN];
long current_line = 0;
while (fgets(buffer, sizeof(buffer), infile) && current_line < MAX_LINES) {
input[current_line] = malloc(strlen(buffer)+1);
strcpy(buffer[current_line++], buffer);
}
If you don't know the number of lines up-front, you typically allocate a number of pointers to start with (about as above), but as you read each line, check whether you've exceeded the current allocation, and if you have realloc the array of pointers to get more space.
If you want to badly enough, you can do the same with each individual line. Above, I've simply set a maximum that's large enough you probably won't exceed it very often with most typical text files. If you need it larger, it's pretty easy to expand that. At the same time, any number you pick will be an arbitrary limit. If you want to, you can read a chunk into your buffer, and if the last character in the string is not a new-line, keep reading more into the same string (and, again, use realloc to expand the allocation as needed). This isn't terribly difficult to do, but covering all the corner cases correctly can/does get tedious.
Edit: I should add that there's a rather different way to get the same basic effect. Read the entire content of the file into a single big buffer, then (typically) use strtok to break the buffer into lines (replacing "\n" with "\0") to build an array of pointers into the buffer. This typically improves speed somewhat (one big read instead of many one-line reads) as well as allocation overhead because you use one big allocation instead of many small ones. Each allocation will typically have a header, and get rounded to something like a (multiple of some) power of two. The effect of this varies with the line length involved. If you have a few long lines, it probably won't matter much. If you have a lot of short lines, it can save a lot.

Resources