Reading a binary file 1 byte at a time

Reading a binary file 1 byte at a time - c

I am trying to read a binary file in C 1 byte at a time and after searching the internet for hours I still can not get it to retrieve anything but garbage and/or a seg fault. Basically the binary file is in the format of a list that is 256 items long and each item is 1 byte (an unsigned int between 0 and 255). I am trying to use fseek and fread to jump to the "index" within the binary file and retrieve that value. The code that I have currently:
unsigned int buffer;
int index = 3; // any index value
size_t indexOffset = 256 * index;
fseek(file, indexOffset, SEEK_SET);
fread(&buffer, 256, 1, file);
printf("%d\n", buffer);
Right now this code is giving me random garbage numbers and seg faulting. Any tips as to how I can get this to work right?

Your confusing bytes with int. The common term for a byte is an unsigned char. Most bytes are 8-bits wide. If the data you are reading is 8 bits, you will need to read in 8 bits:
#define BUFFER_SIZE 256
unsigned char buffer[BUFFER_SIZE];
/* Read in 256 8-bit numbers into the buffer */
size_t bytes_read = 0;
bytes_read = fread(buffer, sizeof(unsigned char), BUFFER_SIZE, file_ptr);
// Note: sizeof(unsigned char) is for emphasis
The reason for reading all the data into memory is to keep the I/O flowing. There is an overhead associated with each input request, regardless of the quantity requested. Reading one byte at a time, or seeking to one position at a time is the worst case.
Here is an example of the overhead required for reading 1 byte:
Tell OS to read from the file.
OS searches to find the file location.
OS tells disk drive to power up.
OS waits for disk drive to get up to speed.
OS tells disk drive to position to the correct track and sector.
-->OS tells disk to read one byte and put into drive buffer.
OS fetches data from drive buffer.
Disk spins down to a stop.
OS returns 1 byte to your program.
In your program design, the above steps will be repeated 256 times. With everybody's suggestion, the line marked with "-->" will read 256 bytes. Thus the overhead is executed only once instead of 256 times to get the same quantity of data.

In your code you are trying to read 256 bytes to the address of one int. If you want to read one byte at a time, call fread(&buffer, 1, 1, file); (See fread).
But a simpler solution will be to declare an array of bytes, read it all together and process it after that.

unsigned char buffer; // note: 1 byte
fread(&buffer, 1, 1, file);
It is time to read mans I believe.

Couple of problems with the code as it stands.
The prototype for fread is:
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream);
You've set the size to 256 (bytes) and the count to 1. That's fine, that means "read one lump of 256 bytes, shove it into the buffer".
However, your buffer is on the order of 2-8 bytes long (or, at least, vastly smaller than 256 bytes), so you have a buffer overrun. You probably want to use fred(&buffer, 1, 1, file).
Furthermore, you're writing byte data to an int pointer. This will work on one endian-ness (small-endian, in fact), so you'll be fine on Intel architecture and from that learn bad habits tha WILL come back and bite you, one of these days.
Try real hard to only write byte data into byte-organised storage, rather than into ints or floats.

You are trying to read 256 bytes into a 4-byte integer variable called "buffer". You are overwriting the next 252 bytes of other data.
It seems like buffer should either be unsigned char buffer[256]; or you should be doing fread(&buffer, 1, 1, f) and in that case buffer should be unsigned char buffer;.
Alternatively, if you just want a single character, you could just leave buffer as int (unsigned is not needed because C99 guarantees a reasonable minimum range for plain int) and simply say:
buffer = fgetc(f);

Related

Using fread() to read a text based file - best practices

Consider this code to read a text based file. This sort of fread() usage was briefly touched upon in the excellent book C Programming: A Modern Approach by K.N. King.
There are other methods of reading text based files, but here I am concerned with fread() only.
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
// Declare file stream pointer.
FILE *fp = fopen("Note.txt", "r");
// fopen() call successful.
if(fp != NULL)
{
// Navigate through to end of the file.
fseek(fp, 0, SEEK_END);
// Calculate the total bytes navigated.
long filesize = ftell(fp);
// Navigate to the beginning of the file so
// it can be read.
rewind(fp);
// Declare array of char with appropriate size.
char content[filesize + 1];
// Set last char of array to contain NULL char.
content[filesize] = '\0';
// Read the file content.
fread(content, filesize, 1, fp);
// Close file stream pointer.
fclose(fp);
// Print file content.
printf("%s\n", content);
}
// fopen() call unsuccessful.
else
{
printf("File could not be read.\n");
}
return 0;
}
There are some problems I have with this method. My opinion is that this is not a safe method of performing fread() since there might be an overflow if we try to read an extremely large string. Is this opinion valid?
To circumvent this issue, we may use a buffer size and keep on reading into a char array of that size. If filesize is less than buffer size, then we simply perform fread() once as described in the above code. Otherwise, We divide the total file size by the buffer size and get a result, whose int portion we will use as the total number of times to iterate a loop where we will invoke fread() each time, appending the read buffer array into a larger string. Now, for the final fread(), which we will perform after the loop, we will have to read exactly (filesize % buffersize) bytes of data into an array of that size and finally append this array into the larger string (Which we would have malloc-ed with filesize + 1 beforehand). I find that if we perform fread() for the last chunk of data using buffersize as its second parameter, then extra garbage data of size (buffersize - chunksize) will be read in and the data might become corrupted. Are my assumptions here correct? Please explain if/ how I have overlooked something.
Also, there is the issue that non-ASCII characters might not have size of 1 byte. In that case I would assume the proper amount is being read, but each byte is being read as a single char, so the text is distorted somehow? How is fread() handling reading of multi-byte chars?

this is not a safe method of performing fread() since there might be an overflow if we try to read an extremely large string. Is this opinion valid?
fread() does not care about strings (null character terminated arrays). It reads data as if it was in multiples of unsigned char*1 with no special concern to the data content if the stream opened in binary mode and perhaps some data processing (e.g. end-of-line, byte-order-mark) in text mode.
Are my assumptions here correct?
Failed assumptions:
Assuming ftell() return value equals the sum of fread() bytes.
The assumption can be false in text mode (as OP opened the file) and fseek() to the end is technical undefined behavior in binary mode.
Assuming not checking the return value of fread() is OK. Use the return value of fread() to know if an error occurred, end-of-file and how many multiples of bytes were read.
Assuming error checking is not required. , ftell(), fread(), fseek() instead of rewind() all deserve error checks. In particular, ftell() readily fails on streams that have no certain end.
Assuming no null characters are read. A text file is not certainly made into one string by reading all and appending a null character. Robust code detects and/or copes with embedded null characters.
Multi-byte: assuming input meets the encoding requirements. Example: robust code detects (and rejects) invalid UTF8 sequences - perhaps after reading the entire file.
Extreme: Assuming a file length <= LONG_MAX, the max value returned from ftell(). Files may be larger.
but each byte is being read as a single char, so the text is distorted somehow? How is fread() handling reading of multi-byte chars?
fread() does not function on multi-byte boundaries, only multiples of unsigned char. A given fread() may end with a portion of a multi-byte and the next fread() will continue from mid-multi-byte.
Instead of of 2 pass approach consider 1 single pass
// Pseudo code
total_read = 0
Allocate buffer, say 4096
forever
if buffer full
double buffer_size (`realloc()`)
u = unused portion of buffer
fread u bytes into unused portion of buffer
total_read += number_just_read
if (number_just_read < u)
quit loop
Resize buffer total_read (+ 1 if appending a '\0')
Alternatively consider the need to read the entire file in before processing the data. I do not know the higher level goal, but often processing data as it arrives makes for less resource impact and faster throughput.
Advanced
Text files may be simple ASCII only, 8-bit code page defined, one of various UTF encodings (byte-order-mark, etc. The last line may or may not end with a '\n'. Robust text processing beyond simple ASCII is non-trivial.
ASCII and UTF-8 are the most common. IMO, handle 1 or both of those and error out on anything that does not meet their requirements.
*1 fread() reads in multiple of bytes as per the 3rd argument, which is 1 in OP's case.
// v --- multiple of 1 byte
fread(content, filesize, 1, fp);

How to read a binary into an array

Say I have a 90 megabyte file. It's not encrypted, but it is binary.
I want to store this file into a table as an array of byte values so I can process the file byte by byte.
I can spare up to 2 GB of ram, so something with a thing like jotting down what bytes have been processed, which bytes have yet to be processed, and the processed bytes, would all be good. I don't exactly care about how long it may take to process.
How should I approach this?

Note I've expanded and rewritten this answer due to Egor's comment.
You first need the file open in binary mode. The distinction is important on Windows, where the default text mode will change line endings from CR+LF into C newlines. You do this by specifying a mode argument to io.open of "rb".
Although you can read a file one byte at a time, in practice you will want to work through the file in buffers. Those buffers can be fairly large, but unless you know you are handling only small files in a one-off script, you should avoid reading the entire file into a buffer with file:read"*a" since that will cause various problems with very large files.
Once you have a file open in binary mode, you read a chunk of it using buffer = file:read(n), where n is an integer count of bytes in the chunk. Using a moderately sized power of two will likely be the most efficient. The return value will either be nil, or will be a string of up to n bytes. If less than n bytes long, that was the last buffer in the file. (If reading from a socket, pipe, or terminal, however, reads less than n may only indicate that no data has arrived yet, depending on lots of other factors to complex to explain in this sentence.)
The string in buffer can be processed any number of ways. As long as #buffer is not too big, then {buffer:byte(1,-1)} will return an array of integer byte values for each byte in the buffer. Too big partly depends on how your copy of Lua was configured when it was built, and may depend on other factors such as available memory as well. #buffer > 1E6 is certainly too big. In the example that follows, I used buffer:byte(i) to access each byte one at a time. That works for any size of buffer, at least as long as i remains an integer.
Finally, don't forget to close the file.
Here's a complete example, lightly tested. It reads a file a buffer at a time, and accumulates the total size and the sum of all bytes. It then prints the size, sum, and average byte value.
-- sum all bytes in a file
local name = ...
assert(name, "Usage: "..arg[0].." filename")
file = assert(io.open(name, "rb"))
local sum, len = 0,0
repeat
local buffer = file:read(1024)
if buffer then
len = len + #buffer
for i = 1, #buffer do
sum = sum + buffer:byte(i)
end
end
until not buffer
file:close()
print("length:",len)
print("sum:",sum)
print("mean:", sum / len)
Run with Lua 5.1.4 on my Windows box using the example as its input, it reports:
length: 402
sum: 30374
mean: 75.557213930348

To split the contents of a string s into an array of bytes use {s:byte(1,-1)}.

How to save a specific length string from a file and work with it in C

So what I'm trying to do is open a file and read it until the end in blocks that are 256 bytes long each time it is called. My dilemma is using fgets() or fread() to do it.
I was using fgets() initially, because it returns a string of the bytes that were read, which is great because I can store that data and work with it. However, in my particular file that I'm reading, the 256 bytes often happen over a more than 2 lines, which is a problem because fgets() stops reading when it hits a newline character or the end of the file.
I then thought of using fread(), but I don't know how to save the line that I'm referring to with it because fread() returns an int referring to the number of elements successfully read (according to its documentation).
I've searched and thought of solutions for a while now and can't find anything that works with my particular scenario. I would like some guidance on how to go about this issue, how would you go about this in my position?

You can use fread() to read each 256 bytes block and keep a lineCount variable to keep track of the number of new line characters you have encountered so far in the input. Since you have to process the blocks already this wouldn't mean much of an overhead in the processing.
To read a block of 256 chars, which is what I think you are doing, you just need to create a buffer of chars that can hold 256 of them, in other words a char array of size 256.
#define BLOCK_SIZE 256
char block[BLOCK_SIZE];
Then if you check the documentation for fread() it shows the following signature:
Following is the declaration for fread() function.
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream)
Parameters
ptr -- This is the pointer to a block of memory with a minimum size of size*nmemb bytes.
size -- This is the size in bytes of each element to be read.
nmemb -- This is the number of elements, each one with a size of size bytes.
stream -- This is the pointer to a FILE object that an input stream.
So this means it takes a pointer to the buffer where it will write the read information, the size of each element it's supposed to read, the maximum amount of elements you want it to read and the file pointer. In your case it would be:
int read = fread(block, sizeof(char), BLOCK_SIZE, file);
This will copy the information from the file to the block array, which you can later process and keep track of the lines. The characters that were read by fread are in the block array, so the first char in the last read block would be block[0], the second block[1] and so on. The returned value in read indicates how many elements (in your case chars) were inserted in the array block when you call fread, this number will be equal to BLOCK_SIZE for every call, unless you reach the end of the file or there's an error.
I suggest you read some documentation for a full example, play a little with the code and do some reading on pointers in C to gain a better understanding of how everything works in general. If you still have questions after that, we can take it from there or you can create a new SO question.

fread, fwrite for big size video file (about 180MB)

I want to read a video file and save as binary and write as a video file again.
I tested with 180MB video. I used fread function and It occur segmentation fault because array size is small for video.
those are my questions:
I use 160*1024 bytes char array. What is the maximum size of char array? How I can solve this problem?
this program need to work as:
read 128 bytes of video -> Encrypt -> write 128 byte
read next 128 bytes -> Encrypt -> write to the next.
I can't upload my code because of security rule of company. Any tip would be appreciated.

first use fseek() with SEEK_END, then use ftell() to determine the file size, after that allocate the needed memory with malloc() and write the data to that memory.
If I understand you correctly you don't need to allocate so much memory, but only 128 Bytes.
char buf[128];
while(/* condition */)
{
ret = fread(buf, sizeof buf, 1, fp_in);
encrypt(buf);
ret = fwrite(buf, sizeof buf, 1, fp_out);
}

How does fread really work?

The declaration of fread is as following:
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream);
The question is: Is there a difference in reading performance of two such calls to fread:
char a[1000];
fread(a, 1, 1000, stdin);
fread(a, 1000, 1, stdin);
Will it read 1000 bytes at once each time?

There may or may not be any difference in performance. There is a difference in semantics.
fread(a, 1, 1000, stdin);
attempts to read 1000 data elements, each of which is 1 byte long.
fread(a, 1000, 1, stdin);
attempts to read 1 data element which is 1000 bytes long.
They're different because fread() returns the number of data elements it was able to read, not the number of bytes. If it reaches end-of-file (or an error condition) before reading the full 1000 bytes, the first version has to indicate exactly how many bytes it read; the second just fails and returns 0.
In practice, it's probably just going to call a lower-level function that attempts to read 1000 bytes and indicates how many bytes it actually read. For larger reads, it might make multiple lower-level calls. The computation of the value to be returned by fread() is different, but the expense of the calculation is trivial.
There may be a difference if the implementation can tell, before attempting to read the data, that there isn't enough data to read. For example, if you're reading from a 900-byte file, the first version will read all 900 bytes and return 900, while the second might not bother to read anything. In both cases, the file position indicator is advanced by the number of characters successfully read, i.e., 900.
But in general, you should probably choose how to call it based on what information you need from it. Read a single data element if a partial read is no better than not reading anything at all. Read in smaller chunks if partial reads are useful.

According to the specification, the two may be treated differently by the implementation.
If your file is less than 1000 bytes, fread(a, 1, 1000, stdin) (read 1000 elements of 1 byte each) will still copy all the bytes until EOF. On the other hand, the result of fread(a, 1000, 1, stdin) (read 1 1000-byte element) stored in a is unspecified, because there is not enough data to finish reading the 'first' (and only) 1000 byte element.
Of course, some implementations may still copy the 'partial' element into as many bytes as needed.

That would be implementation detail. In glibc, the two are identical in performance, as it's implemented basically as (Ref http://sourceware.org/git/?p=glibc.git;a=blob;f=libio/iofread.c):
size_t fread (void* buf, size_t size, size_t count, FILE* f)
{
size_t bytes_requested = size * count;
size_t bytes_read = read(f->fd, buf, bytes_requested);
return bytes_read / size;
}
Note that the C and POSIX standard does not guarantee a complete object of size size need to be read every time. If a complete object cannot be read (e.g. stdin only has 999 bytes but you've requested size == 1000), the file will be left in an interdeterminate state (C99 §7.19.8.1/2).
Edit: See the other answers about POSIX.

fread calls getc internally. in Minix number of times getc is called is simply size*nmemb so how many times getc will be called depends on the product of these two. So Both fread(a, 1, 1000, stdin) and fread(a, 1000, 1, stdin) will run getc 1000=(1000*1) Times.
Here is the siimple implementation of fread from Minix
size_t fread(void *ptr, size_t size, size_t nmemb, register FILE *stream){
register char *cp = ptr;
register int c;
size_t ndone = 0;
register size_t s;
if (size)
while ( ndone < nmemb ) {
s = size;
do {
if ((c = getc(stream)) != EOF)
*cp++ = c;
else
return ndone;
} while (--s);
ndone++;
}
return ndone;
}

There may be no performance difference, but those calls are not the same.
fread returns the number of elements read, so those calls will return different values.
If an element cannot be completely read, its value is indeterminate:
If an error occurs, the resulting value of the file position indicator for the stream is
indeterminate. If a partial element is read, its value is indeterminate. (ISO/IEC 9899:TC2 7.19.8.1)
There's not much difference in the glibc implementation, which just multiplies the element size by the number of elements to determine how many bytes to read and divides the amount read by the member size in the end. But the version specifying an element size of 1 will always tell you the correct number of bytes read. However, if you only care about completely read elements of a certain size, using the other form saves you from doing a division.

One more sentence form http://pubs.opengroup.org/onlinepubs/000095399/functions/fread.html is notable
The fread() function shall read into the array pointed to by ptr up to nitems elements whose size is specified by size in bytes, from the stream pointed to by stream. For each object, size calls shall be made to the fgetc() function and the results stored, in the order read, in an array of unsigned char exactly overlaying the object.
Inshort in both case data will be accessed by fgetc()...!

I wanted to clarify the answers here. fread performs buffered IO. The actual read block sizes fread uses are determined by the C implementation being used.
All modern C libraries will have the same performance with the two calls:
fread(a, 1, 1000, file);
fread(a, 1000, 1, file);
Even something like:
for (int i=0; i<1000; i++)
a[i] = fgetc(file)
Should result in the same disk access patterns, although fgetc would be slower due to more calls into the standard c libraries and in some cases the need for a disk to perform additional seeks which would have otherwise been optimized away.
Getting back to the difference between the two forms of fread. The former returns the actual number of bytes read. The latter returns 0 if the file size is less than 1000, otherwise it returns 1. In both cases the buffer would be filled with the same data, i.e. the contents of the file up to 1000 bytes.
In general, you probably want to keep the 2nd parameter (size) set to 1 such that you get the number of bytes read.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight