Why use 4096 elements for a char array buffer?

Why use 4096 elements for a char array buffer? - c

I found a program that takes in standard input
int main(int argc, char **argv) {
if (argc != 2) {
fprintf(stderr, "Usage: %s <PATTERN>\n", argv[0]);
return 2;
}
/* we're not going to worry about long lines */
char buf[4096]; // 4kibi
while (!feof(stdin) && !ferror(stdin)) { // when given a file through input redirection, file becomes stdin
if (!fgets(buf, sizeof(buf), stdin)) { // puts reads sizeof(buf) characters from stdin and puts it into buf; fgets() stops reading when the newline is read
break;
}
if (rgrep_matches(buf, argv[1])) {
fputs(buf, stdout); // writes the string into stdout
fflush(stdout);
}
}
if (ferror(stdin)) {
perror(argv[0]); // interprets error
return 1;
}
return 0;
}
Why is the buf set to 4096 elements? Is it because the maximum number of characters on each line can only be 4096?

The answer is in the code you pasted:
/* we're not going to worry about long lines */
char buf[4096]; // 4kibi
Lines longer than 4096 characters can occur, but the author didn't deem them worth caring about.
Note also the definition of fgets:
fgets() reads in at most one less than size characters from stream and stores them into the buffer pointed to by s. Reading stops after an EOF or a newline. If a newline is read, it is stored into the buffer. A terminating null byte (\0) is stored after the last character in the buffer.
So if there is a line longer than 4095 characters (since the 4096'th is reserved for the null byte), it will be split across multiple iterations of the while loop.

The program just reads 4096 characters per iteration.
There's no limit in the size of a line, but the may be a limit in the size of the stack ( 8 MB in modern linux systems)
Most programmers choose what fit best for the program being implemented, in this case the programmer commented that there's no need to worry about longer lines.

The author seems to just have a very large memory block for his expected input, to avoid dealing with chunks.
The seemingly awkward number 4096 is most likely explained by the fact that it is a) a power of two number and b) is a memory page size. So when the system chooses to swap out a page to disc, it can do it in one go without any overhead involved.
Wether this really helps is another question, because if you allocate a page with 'malloc', it may not be aligned on a page boundary.
I myself also use such a number often, because it doesn't hurt and in best case it might help. However, it is only really relevant if you are worried about speed and you have reall yontrol over the allocation process in detail. If you allocate a page directly from the OS, then such a size might really have some benefits.

There is no such thing as max no characters in a line. 4096 is taken assuming a normal condition's no lines will be more than 4096 bytes.
It more like preparing for worst case.
Assume you take the size of array less than the sizeof(line) then itbreaks the operation into more than one step till eof is encountered.

I think it is simply that the author chose the char buffer size to be 4*kibi* (4096 = 1024 * 4) by design as commented in code.

Related

C question about allocating memory for a char* to be used with fread

So i have this code
FILE* file = fopen("file.txt", "r");
if(file == NULL)
{
printf("Failed to open file.\n");
return NULL;
}
fseek(file, 0L, SEEK_END);
long bufferSize = ftell(file);
fseek(file, 0L, SEEK_SET);
char* buffer = (char*) malloc(bufferSize);
if(buffer == NULL)
{
printf("Failed to allocate memory for buffer.\n");
return NULL;
}
fread(buffer, sizeof(char), bufferSize, file);
fclose(file);
This seems to work perfectly fine when printing to console with printf("%s", buffer) but i am wondering if this should be causing a buffer overflow or if its wrong since there seemingly isnt a null terminator character at the end.
Lets assume that the file.txt has exactly 4 characters in it. When the bufferSize is calculated it will be a long with the value of 4. So when i am calling malloc(bufferSize) I am creating a buffer with a size of 4 bytes which does not account for a null terminator character. Everywhere i have seen examples of people reading an entire text file they use code like this but shouldnt this be creating a char* with the characters from the file without an ending null terminator character? should i be allocating this buffer using malloc(bufferSize + 1) and adding a null terminator character?

This seems to work perfectly fine when printing to console with printf("%s", buffer)
Seem to working perfectly fine is a perfect manifestation of undefined behavior.
should i be allocating this buffer using malloc(bufferSize + 1) and adding a null terminator character?
If you wish to use %s printf format specifier with the pointer to a consecutive bytes of printable characters, these bytes need to be terminated with a zero byte. Or the other way, %s printf format specifier needs a zero terminated sequence of bytes. Otherwise, undefined behavior happens.
So:
Your input file contains a zero byte, so that %s stops outputting there.
You need to supply a zero terminating byte by yourself, to make sure that %s knows where to stop.
Or you can iterate over the bytes yourself for (...) { printf("%c", buffer[i]); } or (assuming bufferSize is lower then INT_MAX, so probably is) just tell printf when to stop by specifying the precision of the format specifier, like: printf("%.*s", (int)bufferSize, buffer);
or undefined behavior will happen.

Depending upon the size of buffer you allocate and the size of the allocation unit your OS provides, there are often extra bytes at the end of the allocation. Which means that depending how you later use the memory, an exact buffer allocation may lead to fail, or there may be spare byte(s) at the end of the allocation, which your fread() would not overwrite. The result? You may test your program with files that have serendipitous sizes, but programs may fail intermittently once shipped.
Quick fix? Always allocate a bit more space at the end of your buffer - depending upon how your program interprets the bytes (char, short, int, long, long long, struct).
Note that the size of the allocation unit is less likely to save you if the string is nested in a struct, where struct elements are snuggled close together. But odd sized strings would still have spare space, depending upon compiler flags.
Note that your specific usage is finding the end of the file, and slurping the entire file into memory. Likely your OS provides memory in 16, 32, or 64 byte chunks. Which means that you have 1/16, 1/32, or 1/64 chance of accidentally strolling off the end of your allocated buffer.
Suggestions:
(0) Always allocate extra padding, to cushion running into walls.
(1) Consider using fstat() rather than ftell()?
(2) Consider memory mapping the file, rather than using malloc/free and fread.

How to print file contents to stdout without storing them in memory?

My program takes in files with arbitrarily long lines. Since I don't know how much characters would be on a line, I would like to print the whole line to stdout, without malloc-ing an array to store it. Is this possible?
I am aware that it's possible to print these lines one chunk at a time-- however, the function doing the printing would be called very often, and I wish to avoid the overhead of malloc-ing arrays that hold the output, in every single call.

First of all you can't print things that's not exist, means that you have to store it somewhere, either in the stack or heap. If you use FILE* then libc will do it for you automatically.
Now if you use FILE*, you can use getc to get an ASCII character a time, check if the character is a newline character and push it to stdout.
If you's using file descriptor, you can read a character a time and do exactly the same thing.
Both approaches does not require you explicitly allocate memory in the heap.
Now if you use mmap, you can perform some strtok family function and then print the string to stdout.

takes in files with arbitrarily long lines ... print the whole line to stdout, without malloc-ing an array to store it. Is this possible?
In general, for arbitrary long lines: no.
A text stream is an ordered sequence of characters composed into lines, each line consisting of zero or more characters plus a terminating new-line character. C11dr §7.21.2 2
The length of a line is not limited to SIZE_MAX, the longest array possible in C. The length of a line can exceed the memory capacity of the computer. There is just no way to read arbitrary long lines. Simply code could use the following. I doubt it will be satisfactory, yet it does print the entire contents of a file with scant memory.
// Reads one character at a time.
int ch;
while((ch = fgetc(fp)) != EOF) {
putchar(ch);
}
Instead, code should set a sane upper bound on line length. Create an array or allocate for the line. As much as a flexible long line is useful, it is also susceptible to malicious abuse by a hacker exploit consuming unrestrained resources.
#define LINE_LENGTH_MAX 100000
char *line = malloc(LINE_LENGTH_MAX + 1);
if (line) {
while (fgets(line, LINE_LENGTH_MAX+1, fp)) {
if (strlen(line) >= LINE_LENGTH_MAX) {
Handle_Possible_Attach();
}
foo(line); // Use line
}
free(line);
)

Heap Overflow Attack

I am learning about heap overflow attacks and my textbook provides the following vulnerable C code:
/* record type to allocate on heap */
typedef struct chunk {
char inp[64]; /* vulnerable input buffer */
void (*process)(char *); /* pointer to function to process inp */
} chunk_t;
void showlen(char *buf)
{
int len;
len = strlen(buf);
printf("buffer5 read %d chars\n", len);
}
int main(int argc, char *argv[])
{
chunk_t *next;
setbuf(stdin, NULL);
next = malloc(sizeof(chunk_t));
next->process = showlen;
printf("Enter value: ");
gets(next->inp);
next->process(next->inp);
printf("buffer5 done\n");
}
However, the textbook doesn't explain how one would fix this vulnerability. If anyone could please explain the vulnerability and a way(s) to fix it that would be great. (Part of the problem is that I am coming from Java, not C)

The problem is that gets() will keep reading into the buffer until it reads a newline or reaches EOF. It doesn't know the size of the buffer, so it doesn't know that it should stop when it hits its limit. If the line is 64 bytes or longer, this will go outside the buffer, and overwrite process. If the user entering the input knows about this, he can type just the right characters at position 64 to replace the function pointer with a pointer to some other function that he wants to make the program call instead.
The fix is to use a function other than gets(), so you can specify a limit on the amount of input that will be read. Instead of
gets(next->inp);
you can use:
fgets(next->inp, sizeof(next->inp), stdin);
The second argument to fgets() tells it to write at most 64 bytes into next->inp. So it will read at most 63 bytes from stdin (it needs to allow a byte for the null string terminator).

The code uses gets, which is infamous for its potential security problem: there's no way to specify the length of the buffer you pass to it, it'll just keep reading from stdin until it encounters \n or EOF. It may therefore overflow your buffer and write to memory outside of it, and then bad things will happen - it could crash, it could keep running, it could start playing porn.
To fix this, you should use fgets instead.

You can fill up next with more than 64 bytes you will by setting the address for process. Thereby enable one to insert whatever address one wishes. The address could be a pointer to any function.
To fix simple ensure that only 63 bytes (one for null) is read into the array inp - use fgets

The function gets does not limit the amount of text that comes from stdin. If more than 63 chars come from stdin, there will be an overflow.
The gets discards the LF char, that would be an [Enter] key, but it adds a null char at the end, thus the 63 chars limit.
If the value at inp is filled with 64 non-null chars, as it can be directly accessed, the showlen function will trigger an access violation, as strlen will search for the null-char beyond inp to determine its size.
Using fgets would be a good fix to the first problem but it will also add a LF char and the null, so the new limit of readable text would be 62.
For the second, just take care of what is written on inp.

Faster I/O in C

I have a problem which will take 1000000 lines of inputs like below from console.
0 1 23 4 5
1 3 5 2 56
12 2 3 33 5
...
...
I have used scanf, but it is very very slow. Is there anyway to get the input from console in a faster way? I could use read(), but I am not sure about the no of bytes in each line, so I can not as read() to read 'n' bytes.
Thanks,
Very obliged

Use fgets(...) to pull in a line at a time. Note that you should check for the '\n' at the end of the line, and if there is not one, you are either at EOF, or you need to read another buffer's worth, and concatenate the two together. Lather, rinse, repeat. Don't get caught with a buffer overflow.
THEN, you can parse each logical line in memory yourself. I like to use strspn(...) and strcspn(...) for this sort of thing, but your mileage may vary.
Parsing:
Define a delimiters string. Use strspn() to count "non data" chars that match the delimiters, and skip over them. Use strcspn() to count the "data" chars that DO NOT match the delimiters. If this count is 0, you are done (no more data in the line). Otherwise, copy out those N chars to hand to a parsing function such as atoi(...) or sscanf(...). Then, reset your pointer base to the end of this chunk and repeat the skip-delims, copy-data, convert-to-numeric process.

If your example is representative, that you indeed have a fixed format of five decimal numbers per line, I'd probably use a combination of fgets() to read the lines, then a loop calling strtol() to convert from string to integer.
That should be faster than scanf(), while still clearer and more high-level than doing the string to integer conversion on your own.
Something like this:
typedef struct {
int number[5];
} LineOfNumbers;
int getNumbers(FILE *in, LineOfNumbers *line)
{
char buf[128]; /* Should be large enough. */
if(fgets(buf, sizeof buf, in) != NULL)
{
int i;
char *ptr, *eptr;
ptr = buf;
for(i = 0; i < sizeof line->number / sizeof *line->number; i++)
{
line->number[i] = (int) strtol(ptr, &eptr, 10);
if(eptr == ptr)
return 0;
ptr = eptr;
}
return 1;
}
return 0;
}
Note: this is untested (even uncompiled!) browser-written code. But perhaps useful as a concrete example.

You use multiple reads with a fixed size buffer till you hit end of file.

Out of curiosity, what generates that many lines that fast in a console ?

Use binary I/O if you can. Text conversion can slow down the reading by several times. If you're using text I/O because it's easy to debug, consider again binary format, and use the od program (assuming you're on unix) to make it human-readable when needed.
Oh, another thing: there's AT&T's SFIO library, which stands for safer/faster file IO. You might also have some luck with that, but I doubt that you'll get the same kind of speedup as you will with binary format.

Read a line at a time (if buffer not big enough for a line, expand and continue with larger buffer).
Then use dedicated functions (e.g. atoi) rather than general for conversion.
But, most of all, set up a repeatable test harness with profiling to ensure changes really do speed things up.

fread will still return if you try to read more bytes than there are.
I have found on of the fastest ways to read file is like this:
/*seek end of file */
fseek(file,0,SEEK_END);
/*get size of file */
size = ftell(file);
/*seek start of file */
fseek(file,0,SEEK_SET);
/* make a buffer for the file */
buffer = malloc(1048576);
/*fread in 1MB at a time until you reach size bytes etc */
On modern computers put your ram to use and load the whole thing to ram, then you can easily work your way through the memory.
At the very least you should be using fread with block sizes as big as you can, and at least as big as the cache blocks or HDD sector size (4096 bytes minimum, I would use 1048576 as a minimum personally). You will find that with much bigger read requsts rfead is able to sequentially get a big stream in one operation. The suggestion here of some people to use 128 bytes is rediculous.... as you will end up with the drive having to seek all the time as the tiny delay between calls will cause the head to already be past the next sector which almost certainly has sequential data that you want.

You can greatly reduce the time of execution by taking input using fread() or fread_unlocked() (if your program is single-threaded). Locking/Unlocking the input stream just once takes negligible time, so ignore that.
Here is the code:
#include <iostream>
int maxio=1000000;
char buf[maxio], *s = buf + maxio;
inline char getc1(void)
{
if(s >= buf + maxio) { fread_unlocked(buf,sizeof(char),maxio,stdin); s = buf; }
return *(s++);
}
inline int input()
{
char t = getc1();
int n=1,res=0;
while(t!='-' && !isdigit(t)) t=getc1(); if(t=='-')
{
n=-1; t=getc1();
}
while(isdigit(t))
{
res = 10*res + (t&15);
t=getc1();
}
return res*n;
}
This is implemented in C++. In C, you won't need to include iostream, function isdigit() is implicitly available.
You can take input as a stream of chars by calling getc1() and take integer input by calling input().
The whole idea behind using fread() is to take all the input at once. Calling scanf()/printf(), repeatedly takes up valuable time in locking and unlocking streams which is completely redundant in a single-threaded program.
Also make sure that the value of maxio is such that all input can be taken in a few "roundtrips" only (ideally one, in this case). Tweak it as necessary.
Hope this helps!

File Handling question on C programming

I want to read line-by-line from a given input file,, process each line (i.e. its words) and then move on to other line...
So i am using fscanf(fptr,"%s",words) to read the word and it should stop once it encounters end of line...
but this is not possible in fscanf, i guess... so please tell me the way as to what to do...
I should read all the words in the given line (i.e. end of line should be encountered) to terminate and then move on to other line, and repeat the same process..

Use fgets(). Yeah, link is to cplusplus, but it originates from c stdio.h.
You may also use sscanf() to read words from string, or just strtok() to separate them.
In response to comment: this behavior of fgets() (leaving \n in the string) allows you to determine if the actual end-of-line was encountered. Note, that fgets() may also read only part of the line from file if supplied buffer is not large enough. In your case - just check for \n in the end and remove it, if you don't need it. Something like this:
// actually you'll get str contents from fgets()
char str[MAX_LEN] = "hello there\n";
size_t len = strlen(str);
if (len && str[len-1] == '\n') {
str[len-1] = 0;
}
Simple as that.

If you are working on a system with the GNU extensions available there is something called getline (man 3 getline) which allows you to read a file on a line by line basis, while getline will allocate extra memory for you if needed. The manpage contains an example which I modified to split the line using strtok (man 3 strtrok).
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
FILE * fp;
char * line = NULL;
size_t len = 0;
ssize_t read;
fp = fopen("/etc/motd", "r");
if (fp == NULL)
{
printf("File open failed\n");
return 0;
}
while ((read = getline(&line, &len, fp)) != -1) {
// At this point we have a line held within 'line'
printf("Line: %s", line);
const char * delim = " \n";
char * ptr;
ptr = (char * )strtok(line,delim);
while(ptr != NULL)
{
printf("Word: %s\n",ptr);
ptr = (char *) strtok(NULL,delim);
}
}
if (line)
{
free(line);
}
return 0;
}

Given the buffering inherent in all the stdio functions, I would be tempted to read the stream character by character with getc(). A simple finite state machine can identify word boundaries, and line boundaries if needed. An advantage is the complete lack of buffers to overflow, aside from whatever buffer you collect the current word in if your further processing requires it.
You might want to do a quick benchmark comparing the time required to read a large file completely with getc() vs. fgets()...
If an outside constraint requires that the file really be read a line at a time (for instance, if you need to handle line-oriented input from a tty) then fgets() probably is your friend as other answers point out, but even then the getc() approach may be acceptable as long as the input stream is running in line-buffered mode which is common for stdin if stdin is on a tty.
Edit: To have control over the buffer on the input stream, you might need to call setbuf() or setvbuf() to force it to a buffered mode. If the input stream ends up unbuffered, then using an explicit buffer of some form will always be faster than getc() on a raw stream.
Best performance would probably use a buffer related to your disk I/O, at least two disk blocks in size and probably a lot more than that. Often, even that performance can be beat by arranging the input to be a memory mapped file and relying on the kernel's paging to read and fill the buffer as you process the file as if it were one giant string.
Regardless of the choice, if performance is going to matter then you will want to benchmark several approaches and pick the one that works best in your platform. And even then, the simplest expression of your problem may still be the best overall answer if it gets written, debugged and used.

but this is not possible in fscanf,
It is, with a bit of wickedness ;)
Update: More clarification on evilness
but unfortunately a bit wrong. I assume [^\n]%*[^\n] should read [^\n]%*. Moreover, one should note that this approach will strip whitespaces from the lines. – dragonfly
Note that xstr(MAXLINE) [^\n] reads MAXLINE characters which can be anything except the newline character (i.e. \n). The second part of the specifier i.e. *[^\n] rejects anything (that's why the * character is there) if the line has more than MAXLINE characters upto but NOT including the newline character. The newline character tells scanf to stop matching. What if we did as dragonfly suggested? The only problem is scanf will not know where to stop and will keep suppressing assignment until the next newline is hit (which is another match for the first part). Hence you will trail by one line of input when reporting.
What if you wanted to read in a loop? A little modification is required. We need to add a getchar() to consume the unmatched newline. Here's the code:
#include <stdio.h>
#define MAXLINE 255
/* stringify macros: these work only in pairs, so keep both */
#define str(x) #x
#define xstr(x) str(x)
int main() {
char line[ MAXLINE + 1 ];
/*
Wickedness explained: we read from `stdin` to `line`.
The format specifier is the only tricky part: We don't
bite off more than we can chew -- hence the specification
of maximum number of chars i.e. MAXLINE. However, this
width has to go into a string, so we stringify it using
macros. The careful reader will observe that once we have
read MAXLINE characters we discard the rest upto and
including a newline.
*/
int n = fscanf(stdin, "%" xstr(MAXLINE) "[^\n]%*[^\n]", line);
if (!feof(stdin)) {
getchar();
}
while (n == 1) {
printf("[line:] %s\n", line);
n = fscanf(stdin, "%" xstr(MAXLINE) "[^\n]%*[^\n]", line);
if (!feof(stdin)) {
getchar();
}
}
return 0;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight