Read file byte by byte using read() - c

I am trying to wrap my head around the read() system call.
How can I read an actual file byte by byte using read()?
The first parameter is the file descriptor which is of type int.
How can I pass a file to the read() call?

You open the file with open(); you pass the file descriptor returned by open() to read().
int fd;
if ((fd = open(filename, O_RDWR)) >= 0)
{
char c;
while (read(fd, &c, 1) == 1)
putchar(c);
}
There are other functions that return file descriptors: creat(), pipe(), socket(), accept(), etc.
Note that while this would work, it is inefficient because it makes a lot system calls. Normally, you read large numbers of bytes at a time so as to cut down on the number of system calls. The standard I/O libraries (in <stdio.h>) handle this automatically. If you use the low-level open(), read(), write(), close() system calls, you have to worry about buffering etc for yourself.

The last argument to read() is the number of bytes to read from the file, so passing 1 to it would do it. Before that, you use open() to get a file handle, something like this (untested code):
int fh = open("filename", O_RDONLY);
char buffer[1];
read(fh, buffer, 1);
However, it's usually not recommended to read files byte by byte, as it affects performance significantly. Instead, you should buffer your input and process it in chunks, like so:
int fh = open("filename", O_RDONLY);
char buffer[BUFFER_SIZE];
read(fh, buffer, BUFFER_SIZE);
for (int i=0 ; i < BUFFER_SIZE ; ++i) {
// process bytes at buffer[i]
}
You would finally wrap your reads in a loop until EOF is reached.

The concept of read() system call to Kernel is this (In simple english)
read (from this file (file descriptor), into this buffer in the memory, of this size )
Example: Read a character by character from a file which is in the disk into this buffer BUFF
int fd // initialize the File Descriptor
fd = open ("file_name", O_RDONLY); //open a file with file name in read only mode.
char BUFF;
read (fd,&BUFF,sizeof(char)); // read file with file descriptor into the address of the BUFF buffer in the memory of a character of size CHAR data type.

Related

Read/write program in c that copies file and i think its taking too long to copy

First code uses preset buffer and when i set buffer to 512byte and i need to copy 100MB file it takes about 1 second, but when I use 1byte buffer it takes over 3 minutes to copy 100MB file, on the other hand i have other code that uses fread and fwrite functions and it is about 0.5sec faster on 512byte buffer but it only takes him about 13 seconds to copy 100 mb file with 1byte buffer can someone see any error in code that uses system calls(read, write, open)
1. Code that uses(read, write...)
int main(int argc, char* argv[])
{
char sourceName[20], destName[20], bufferStr[20];
int f1, f2, fRead;
int bufferSize = 0;
char* buffer;
bufferSize = atoi(argv[3]);
buffer = (char*)calloc(bufferSize, sizeof(char));
strcpy(sourceName, argv[1]);
f1 = open(sourceName, O_RDONLY);
if (f1 == -1)
printf("something's wrong with oppening source file!\n");
else
printf("file opened!\n");
strcpy(destName, argv[2]);
f2 = open(destName, O_CREAT | O_WRONLY | O_TRUNC | O_APPEND);
if (f2 == -1)
printf("something's wrong with oppening destination file!\n");
else
printf("file2 opened!");
fRead = read(f1, buffer, sizeof(char));
while (fRead != 0)
{
write(f2, buffer, sizeof(char));
fRead = read(f1, buffer, sizeof(char);
}
close(f1);
close(f2);
return 0;
}
2. Code that uses(fread, fwrite...)
int main(int argc, char* argv[]) {
FILE* fsource, * fdestination;
char sourceName[20], destinationName[20], bufferSize[20];
//scanf("%s %s %s", sourceName, destinationName, bufferSize);
strcpy(sourceName, argv[1]);
strcpy(destinationName, argv[2]);
strcpy(bufferSize, argv[3]);
int bSize = atoi(bufferSize);
printf("bSize = %d\n", bSize);
fsource = fopen(sourceName, "r");
if (fsource == NULL)
printf("read file did not open\n");
else
printf("read file opened sucessfully!\n");
fdestination = fopen(destinationName, "w");
if (fdestination == NULL)
printf("write file did not open\n");
else
printf("write file opened sucessfully!\n");
char *buffer = (char*)calloc(bSize, sizeof(char));
int flag;
printf("size of buffer: %d", bSize);
while (0 < (flag = fread(buffer, sizeof(char), bSize, fsource)))
fwrite(buffer, sizeof(char), bSize, fdestination);
fclose(fsource);
fclose(fdestination);
return 0;
}
EDIT:
These are my measurements for buffers
I took 20 measurements for each buffer and each file malaDat(1byte), srednjaDar(100MB), velikaDat(1GB)
Side note: sizeof(char) is always 1 by definition. So, just don't use sizeof(char)--it's frowned upon. And, I think it's adding to your confusion.
Because your example using read/write is using sizeof(char) as the count (the 3rd argument), it is only transferring one byte on each loop iteration (i.e. very slow).
At a guess, I think you're confusing the count for read/write with the size argument to fread/fwrite.
What you want is:
while (1) {
fRead = read(f1, buffer, bufferSize);
if (fRead <= 0)
break;
write(f2, buffer, fRead);
}
Also, fread/fwrite will [probably] pick an optimal buffer size [possibly 4096]. You can use setbuf to change that.
I've found [from some documentation somewhere] that for most filesystems [under linux, at least] the optimal transfer size is 64KB (i.e. 64 * 1024).
So, try setting bufferSize = 64 * 1024.
IIRC, there is an ioctl/syscall that can return the value of the optimal size, but I forget what it is.
UPDATE:
Ok but when I choose 1byte buffer it is still too slow it takes more than 40mins to copy.
Of course, a 1 byte transfer [buffer] size will produce horrible results.
and when i use fread and fwrite with same buffer it takes way less time it takes about 3mins why is that?
How big is the file (i.e. what are the respective transfer rates [in MB/sec])? I'll assume your system can transfer at 10 MB/sec [conservative--30 MB/sec [minimum] for recently new system]. So, this is 600 MB/min and the file is approx 1.8 GB?
When you specify a 1 byte transfer/buffer size to read/write they do exactly what you tell them to do. Transfer 1 byte. So, you'll do ~2 billion read syscalls and 2 billion write syscalls!!!
syscalls are generally slow.
stdio streams have an internal buffer. This is set to an optimal size, let's say: 4096.
fread [and fwrite] will fill/drain that buffer by calling [internally] read/write with a count of 4096.
So, fread/fwrite are doing 4096 times fewer syscalls. So, only about 470,000 syscalls. Quite a reduction.
The transfer to your buffer from the internal buffer is done a byte at a time, but that is only a short/fast memcpy operation done totally within the userspace application. This is much faster than issuing a syscall on a per byte basis.
So, the transfer size you pass to fread/fwrite does not affect the size of its internal buffer.
fread/fwrite only issue a read/write syscall to replenish/drain the stream's internal buffer when it is empty/full [respectively], regardless of what length you give in your fread/fwrite call.
If you want to slow down fread/fwrite, look at the man page for setbuf et. al. and do:
setbuffer(fsource,NULL,1);
setbuffer(fdestination,NULL,1);
UPDATE #2:
So this is totally normal? I am asking because this is my university task to do this measurements and my colleagues are getting results for system calls about 2 mins slower then user calls and i get much slower results
If you check with them, I'll bet they're using a larger buffer size.
Remember that your original read/write code had a bug that would only transfer a byte at a time, regardless of what you set bufferSize to [from the command line].
That's why I changed the loop in my original post.
To overachieve ...
Look at O_DIRECT for open. If you use posix_memalign instead of malloc, you can force buffer alignment to be a multiple of the page size (4096) that allows O_DIRECT to work. And, setting a buffer size that is a multiple of the page size.
This option bypasses the read/write syscall's copy/transfer operation from your userspace from the kernel's internal page/filesystem cache and has the DMA H/W transfer directly to/from your buffer.
Also, consider adding O_NOATIME
Also, there is a linux syscall that is specifically designed to bypass all userspace memory/buffering to have the kernel do a file-to-file copy. It is sendfile, but it is similar to memcpy but uses file descriptors, an offset and a length.
And, the fastest way to access file data is to use mmap. See my answers:
How does mmap improve file reading speed?
read line by line in the most efficient way *platform specific*

Write to the same file with different processes in order of occurence

I am working on a UNIX based operating system (Lubuntu 14.10. I have several processes that need to print a message to the same file and to the std output.
When I print my message to the screen, it works the way I want, in the order of occurence. E.g:
Process1_message1
Process2_message1
Process3_message1
Process1_message2
Process2_message2
Process3_message2
...
However, when I check the output file it is like below:
Process1_message1
Process1_message2
Process2_message1
Process2_message2
Process3_message1
Process3_message2
...
I use fprintf(FILE *ptr, char *str) to write the message to the file.
Note: I opened the file with following format in the main process:
fptr=fopen("output.txt", "a");
where fptr is a global FILE *.
Any help will be appreciated. Thank you!
fprintf() isn't going to work. It's prone being translated into multiple calls to write() to actually write out the data, exactly like you posted. You call fprintf() once, and under the covers it makes multiple calls to write() to actually write the data into the file.
You need to use open( filename, O_WRONLY | O_CREAT | O_APPEND, 0600 ), and write data something like this in order to ensure you only call write() once, which is guaranteed to be atomic:
ssize_t myprintf( int fd, const char *fmt, ... )
{
char buffer[ 1024 ];
ssize_t bytesWritten;
va_list argp;
va_start( argp, fmt );
int bytes = vsnprintf( buffer, sizeof( buffer ), fmt, argp );
if ( bytes < sizeof( buffer ) )
{
bytesWritten = write( fd, buffer, bytes );
}
// buffer was too small, get a bigger one
else
{
char *bufptr = malloc( bytes + 1 );
bytes = vsnprintf( bufptr, bytes + 1, fmp, argp );
bytesWritten = write( fd, bufptr, bytes );
free( bufptr );
}
return( bytesWritten );
}
Most likely, your problem is that the file output is fully buffered, so the output from each process doesn't appear until the standard I/O buffer for the stream (in that process) is full.
You can probably work around it sufficiently by setting line buffering:
FILE *fptr = fopen("output.txt", "a");
if (fptr != 0)
{
setvbuf(fptr, 0, _IOLBF, BUFSIZ);
…code using fptr — including your fork() calls…
fclose(fptr);
}
Every time a process writes a line to the buffer, it will be flushed. You might run into problems if your output lines are longer than BUFSIZ; then you might want to increase the size passed to setvbuf() to the largest line length you need written atomically.
If that still isn't good enough, or if you need to be able to write groups of lines at one time, you'll have to go to a solution using file descriptors as in Andrew Henle's answer. You might want to look at the O_SYNC and O_DSYNC options to open().
Flushing buffers is different in stdio when you are writing to a terminal (isatty(fptr) ---see isatty(3)--- returns true) than when you output to a file. For a file, stdio output only does a write(2) system call when the buffer is filled up and this makes all the messages to appear together (as each buffer flushes out on exit, they fill up in one single output buffer) On ttys, output is flushed when buffer fills up or when a \n char is output to the buffer (as a compromise on buffering/non buffering)
You can force buffer flushing with fflush(fptr); after fprintf(fptr, ...); or even do fflush(NULL); (which flushes all output buffers in one call).
But, be carefull as the writes are the ones that control the atomicity of calls (not the fprintf calls) so, if you have to write several pages of output in one fprintf call, be ready to accept messed output.

dup() and cache flush

I am a C beginner, trying to use dup(), I wrote a program to test this function, the result is a little different from what I expected.
Code:
// unistd.h, dup() test
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
extern void dup_test();
int main() {
dup_test();
}
// dup()test
void dup_test() {
// open a file
FILE *f = fopen("/tmp/a.txt", "w+");
int fd = fileno(f);
printf("original file descriptor:\t%d\n",fd);
// duplicate file descriptor of an opened file,
int fd_dup = dup(fd);
printf("duplicated file descriptor:\t%d\n",fd_dup);
FILE *f_dup = fdopen(fd_dup, "w+");
// write to file, use the duplicated file descriptor,
fputs("hello\n", f_dup);
fflush(f_dup);
// close duplicated file descriptor,
fclose(f_dup);
close(fd_dup);
// allocate memory
int maxSize = 1024; // 1 kb
char *buf = malloc(maxSize);
// move to beginning of file,
rewind(f);
// read from file, use the original file descriptor,
fgets(buf, maxSize, f);
printf("%s", buf);
// close original file descriptor,
fclose(f);
// free memory
free(buf);
}
The program try write via the duplicated fd, then close the duplicated fd, then try to read via the original fd.
I expected that when I close the duplicated fd, the io cache will be flushed automatically, but it's not, if I remove the fflush() function in the code, the original fd won't be able to read the content written by the duplicated fd which is already closed.
My question is:
Does this means when close the duplicated fd, it won't do flush automatically?
#Edit:
I am sorry, my mistake, I found the reason, in my initial program it has:
close(fd_dup);
but don't have:
fclose(f_dup);
after use fclose(f_dup); to replace close(f_dup); it works.
So, the duplicated fd do automatically flush if close in a proper way, write() & close() is a pair, fwrite() & fclose() is a pair, should not mix them.
Actually, in the code I could have use the duplicated fd_dup directly with write() & close(), and there is no need to create a new FILE at all.
So, the code could simply be:
// unistd.h, dup() test
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#define BUF_SIZE 1024 // 1 kb
extern void dup_test();
int main() {
dup_test();
}
// dup()test
void dup_test() {
// open a file
FILE *f = fopen("/tmp/a.txt", "w+");
int fd = fileno(f);
printf("original file descriptor:\t%d\n",fd);
// duplicate file descriptor of an opened file,
int fd_dup = dup(fd);
printf("duplicated file descriptor:\t%d\n",fd_dup);
// write to file, use the duplicated file descriptor,
write(fd_dup, "hello\n", BUF_SIZE);
// close duplicated file descriptor,
close(fd_dup);
// allocate memory
char *buf = malloc(BUF_SIZE);
// move to beginning of file,
rewind(f);
// read from file, use the original file descriptor,
fgets(buf, BUF_SIZE, f);
printf("%s", buf);
// close original file descriptor,
fclose(f);
// free memory
free(buf);
}
From dup man pages:
After a successful return from one of these system calls, the old and new file descriptors maybe used interchangeably. They refer to the same open file description (see open(2))and thus share file offset and file status flags; for example, if the file offset is modified by using lseek(2) on one of the descriptors, the offset is also changed for the other.
It means the seek pointer is changed when you write to the duplicated file descriptor, so, reading from the first file descriptor after writing to the duplication shouldn't read any data.
You are using fdopen to create separated seek_ptr and end_ptr of the duplicated stream, in that way, the fd_dup stops being a duplication. That's why you can read data after flushing and closing the stream.
I couldn't find any strong facts about why you can't read if you don't flush the second file descriptor. I can add that it may be related to sync system call.
After all, if you need a IO buffer, you might be using the wrong mechanism, check named pipes and other buffering OS mechanism.
I cannot really understand your problem. I tested it under Microsoft VC2008 (had to replace unistd.h with io.h) and gcc 4.2.1.
I commented out fflush(f_dup) because it is no use before a close and close(fd_dup); because the file descriptor was already closed, so the piece of code now looks like :
// write to file, use the duplicated file descriptor,
fputs("hello\n", f_dup);
// fflush(f_dup);
// close duplicated file descriptor,
fclose(f_dup);
// close(fd_dup);
And it works correctly. I get on both systems :
original file descriptor: 3
duplicated file descriptor: 4
hello

reading from a file descriptor in C

(correct me if im wrong on my terms) So i need to read from a file descriptor, but the read method takes in a int for byte size to read that much OR i can use O_NONBLOCK, but i still have to setup up a buffer size of an unknown size. making it difficult. heres what i have so far
this is my method that handles all the polling and mkfifo. and N is already predefined in main
struct pollfd pfd[N];
int i;
for(i = 0; i < N; i++)
{
char fileName[32];
snprintf (fileName, sizeof(fileName), "%d_%di", pid, i);
mkfifo(fileName, 0666);
pfd[i].fd = open(fileName, O_RDONLY | O_NDELAY);
pfd[i].events = POLLIN;
pfd[i].revents = 0;
snprintf (fileName, sizeof(fileName), "%d_%do", pid, i);
mkfifo(fileName, 0666);
i++;
pfd[i].fd = open(fileName, O_WRONLY | O_NDELAY);
pfd[i].events = POLLOUT;
pfd[i].revents = 0;
i--;
}
while(1)
{
int len, n;
n = poll(pfd, N, 2000);
if( n < 0 )
{
printf("ERROR on poll");
continue;
}
if(n == 0)
{
printf("waiting....\n");
continue;
}
for(i = 0; i < N; i++)
{
char buff[1024]; <---i dont want to do this
if (pfd[i].revents & POLLIN)
{
printf("Processing input....\n");
read(pfd[i].fd, buff, O_NONBLOCK);
readBattlefield(buff);
print_battleField_stats();
pfd[i].fd = 0;
}
}
}
i also read somewhere that once read() reads all the data coming, it empties the pipe, meaning i can use the same again for another incoming data. but it doesnt empty the pipe because i cant use the same pipe again. I asked my professor but all he says was to use something like scanf, but how do use scanf if scanf takes a FILE stream, and the poll.fd is an int? essentially my ultimate question is, how to read the incoming data through the file descriptor using scan or of other sort? using scan will help me more with handling the data.
EDIT:
in another terminal i have to put cat file > (named_file)
and my main program will read the input data. heres what the input data looks like
3 3
1 2 0
0 2 0
3 0 0
first 2 numbers are grid information and player number, and after that is grid, but this a simplified version, ill be dealing with sizes over 100's of players and grids of over 1000's
char buff[1024]; <---i dont want to do this
What would you like to do then? This is how it works. This is not how it works:
read(pfd[i].fd, buff, O_NONBLOCK);
This will compile because O_NONBLOCK is an integer #define, but it is absolutely and unequivocally incorrect. The third argument to read() is a number of bytes to read. Not a flag. Period. It may be zero, but what you've done here is pass an arbitrary number -- whatever the value of O_NONBLOCK is, which could easily be more than 1024, the size of your buffer. This does not set the read non-block. recv() is similar to read() and does take such flags as a forth argument, but you can't use that with a file descriptor. If you want to set non-block on a file descriptor, you must do it with open() or fcntl().
how to read the incoming data through the file descriptor using scan or of other sort?
You can create a FILE* stream from an open descriptor with fdopen().
i also read somewhere that once read() reads all the data coming, it empties the pipe, meaning i can use the same again for another incoming data. but it doesnt empty the pipe because i cant use the same pipe again.
Once you reach EOF (because the writer closed the connection), read() will return 0, and continue to return 0 immediately until someone opens the pipe again.
If you set the descriptor non-block, read() will always return immediately; if there is someone connected and nothing to read, it will return -1 but errno will == EAGAIN. See man 2 read.
man fifo is definitely something you should read; if there's anything you aren't sure about, ask a specific question based on that.
And don't forget: Fix that read() call. It's wrong. W R O N G. Your prof/TA/whoever will not miss that.

Read only buffered date from FILE object

I'd like to read only what is already in the buffer of a FILE object, so that afterwards the buffer is empty (and I can use things like sendfile which operates on file descriptors). I came up with this function, which seem to work on my 64bit Linux installation:
int readbuf(FILE *stream, char buf[], size_t *size) {
off_t pos = ftello(stream);
if (pos < 0) return -1;
off_t realpos = lseek(fileno(stream), 0, SEEK_CUR);
if (realpos < 0) return -1;
if (pos > realpos) {
errno = EIO;
return -1;
}
size_t bufsize = realpos - pos;
if (bufsize > *size) {
*size = bufsize;
errno = ERANGE;
return -1;
}
*size = bufsize;
if (fread(buf, bufsize, 1, stream) < 1) {
return -1;
}
return 0;
}
Now I wonder, can I assume this to work on other POSIX compliant operating systems? (On systems that provide all the involved functions.)
If the underlying file descriptor is seekable (either a regular file or a block device, unless you have other weird seekable objects on your system...) then there's no point in what you're trying to do. Just use ftello to get the logical position in the FILE, then discard the FILE and use sendfile. Using the already-buffered data in userspace is actually slower than sendfile anyway.
If the underlying file descriptor is not seekable, your whole approach does not work, because lseek will always return -1 and ftello will return EOF. A potential solution in this case:
Use dup to make a new file descriptor referring to the same open file description.
Open /dev/null write-only, and dup2 it on top of the old file descriptor number used by the FILE.
Reading from the FILE will succeed until the buffer is exhausted, then give read errors, since the file descriptor now refers to a non-readable file.
At this point, you're free to read directly from the duplicated fd made in the first step. You're also free to fclose the FILE.
For seekable files on Unix platforms you're supposed to be able to use fflush() to coordinate fd-based use with FILE*-based use, including for reading. The full details are given in http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_05_01 and http://pubs.opengroup.org/onlinepubs/9699919799/functions/fflush.html.
This is an extension over what standard C gives you (unsurprisingly).
I do not believe the stdio API guarantees that this would work on any system. For instance, it might perform readahead if it notices the buffer is empty.
Your "solution" would be at most a specific implementation hack.

Resources