C's write() throughput inconsistency for larger buffer - c

I wrote a small program to measure the system time for C's write() call. I keep appending a buffer to a file until it reaches a certain file size. But for two different buffer sizes, I get drastically different numbers.
Here is a snippet of case #1:
char buffer1[] = malloc(4* 1024);
for(int i=0; i< 1048576; i++){
int w = write(outputfile, buffer1, sizeof(buffer1));
}
Here is a snippet of case #2:
char buffer2[] = malloc(1024* 1024);
for(int i=0; i< 4096; i++){
int w = write(outputfile, buffer2, sizeof(buffer2));
}
You can see in both cases the program writes 1048576x4kB = 4096x1024kB = ~4096 Mbytes of data to a file.
In my machine (8 gig ddr3 ram, core i7, 240gb ssd), case#1 takes 14.96 sys time to finish, giving me a throughput approx 274 MB/s.
Whereas case#2 takes 0.9 seconds sys time to finish, giving me a throughput approx 4551 MB/s.
There are some intermediate runs I did for some other buffer sizes, they also produce highly varying numbers.
I know a larger buffer size means less number of calls to write() function. But, isn't that each call should have taken longer, and eventually the overall time it takes to finish writing on file should be the same, regardless of buffer size? Why does the throughput vary so much for varying buffer sizes?
Here is the program:
https://drive.google.com/file/d/1Bj_CnO8DqFrOO3WwbsZbzFYjHLijTW7A/view?usp=sharing

Related

How to correctly time speed of writing to a disk (i.e. file) in C

I have a Zynq SoC which runs a Linux system and would like to time the speed at which I can write to its SD card in C.
I have tried out clock() and clock_gettime() where for the latter, I have done the following:
#define BILLION 1000000000.0
#define BUFFER_SIZE 2097152
...
for (int i=0; i<100; i++) {
memcpy(buf, channel->dma + offset, BUFFER_SIZE);
/* Open file for output */
FILE *f = fopen(path, "w");
/* Start timer */
clock_gettime(CLOCK_MONOTONIC, &start);
/* Write data from cpu ram to sd card*/
fwrite(buf, 1, BUFFER_SIZE, f);
fclose(f);
/* Stop timer and report result */
clock_gettime(CLOCK_MONOTONIC, &end);
elapsed_time = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / BILLION;
printf("%lf seconds\n", elapsed_time);
}
The times that I'm getting are around 0.019 seconds for 2097152 bytes which I believe is wrong because the speed of writing to the SD card is quoted as 10 MB/s and can't possibly be the 110 MB/s that I appear to be getting. When I, instead put the timing code outside the loop and, basically, time how long it takes to write the whole 100*BUFFER_SIZE, I get a more reasonable 18 MB/s.
My question is, is it incorrect to try to time a single fwrite operation? Does clock_gettime() not give me the adequate accuracy/precision for that? I would like to have the most accurate value possible for how quickly I can write to the disk because it is a very important parameter in the design of the system and so this discrepancy I'm getting is rather disconcerting.
The Linux Kernel often caches read write access to disk storage. That is, it returns from the write call and does the actual writing in the background. It does that transparently, so that if you would read the same file, just after writing, you would get the results you wrote, even if they are not yet transferred and written completely to disk.
To force a complete write you can call the fsync function. It blocks until all file IO for a specific file has completed. In your case a call to
#include <unistd.h>
...
fwrite(...);
fsync(fileno(f));
should suffice.
Edit
as #pmg mentioned in the comments, there is also be buffering at stream level. Although it probably is not that large of a buffer, you can force the stream buffer to be written with a call to fflush() before the fsync.

How to read numbers from a file and measure the time it takes?

I have a text file called 10.text that contains 10 large numbers. I want to read those numbers and store them in an array of size 10. Then i want to print out these nmbers and the time it takes to read the file.
Here is my attempt:
int main(int argc, const char * argv[]) {
int i;
clock_t start, end;
double cpu_time_used;
double TimeOfReadingFile;
FILE *myFile;
int readarr10[10];
start = clock();
myFile=fopen("10.txt" ,"r");
printf("SIZE 10\n");
printf("----------------\n");
for (i = 0; i < 10; i++)
fscanf(myFile, "%d", &readarr10[i]);
end = clock();
for (i=0; i<10; i++) {
printf("%d ",readarr10[i]);
printf("\n");
}
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
TimeOfReadingFile=cpu_time_used*pow(10, 9);
printf("time for reading the 10 file: %f\n",TimeOfReadingFile);
return 0;}
the problem is that when i run it i get zeros for all thr numbers and a zero for the time.
This code is part of a project i am working on and i cannot continue if i dont get it right!
Reading 10 numbers can be below the definition of clock. On my BSD system on a virtual machine, I had to loop more than 1000 times to begin to see something. With a loop of 1000000 times (read 10 integers, rewind file) I get (more or less 2.5 micro sec for the operation. If I include the fopen/fclose in the loop, I end at 24 micro sec.
I think (but I must acknowledge that I am not sure of it), that a rewind on a small FILE* only resets the file pointer but does not read again any sector from the disk. On the other hand, reading 10 integers (one single disk operation) should be short behind opening and closing a file.
So my conclusion is (on my system):
file access: more or less 20 micro seconds
decoding 10 integer: more of less 2 micro seconds.
I have a not too fast disk and the overhead on running a VirtualBox machine, so on high speed computers, times could even be smaller. So on one single iteration it is no surprise that end == start.

Fast I/O in c, stdin/out

In a coding competition specified at this link there is a task where you need to read much data on stdin, do some calculations and present a whole lot of data on stdout.
In my benchmarking it is almost only i/o that takes time although I have tried optimizing it as much as possible.
What you have as input is a string (1 <= len <= 100'000) and q rows of pair of int where q also is 1 <= q <= 100'000.
I benchmarked my code on a 100 times larger dataset (len = 10M, q = 10M) and this is the result:
Activity time accumulated
Read text: 0.004 0.004
Read numbers: 0.146 0.150
Parse numbers: 0.200 0.350
Calc answers: 0.001 0.351
Format output: 0.037 0.388
Print output: 0.143 0.531
By implementing my own formating and number parsing inline i managed to get the time down to 1/3 of the time when using printf and scanf.
However when I uploaded my solution to the competitions webpage my solution took 1.88 seconds (I think that is the total time over 22 datasets). When I look in the high-score there are several implementations (in c++) that finished in 0.05 seconds, nearly 40 times faster than mine! How is that possible?
I guess that I could speed it up a bit by using 2 threads, then I can start calculating and writing to stdout while still reading from stdin. This will however decrease the time to min(0.150, 0.143) in a theoretical best case on my large dataset. I'm still nowhere close to the highscore..
In the image below you can see the statistics of the consumed time.
The program gets compiled by the website with this options:
gcc -g -O2 -std=gnu99 -static my_file.c -lm
and timed like this:
time ./a.out < sample.in > sample.out
My code looks like this:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_LEN (100000 + 1)
#define ROW_LEN (6 + 1)
#define DOUBLE_ROW_LEN (2*ROW_LEN)
int main(int argc, char *argv[])
{
int ret = 1;
// Set custom buffers for stdin and out
char stdout_buf[16384];
setvbuf(stdout, stdout_buf, _IOFBF, 16384);
char stdin_buf[16384];
setvbuf(stdin, stdin_buf, _IOFBF, 16384);
// Read stdin to buffer
char *buf = malloc(MAX_LEN);
if (!buf) {
printf("Failed to allocate buffer");
return 1;
}
if (!fgets(buf, MAX_LEN, stdin))
goto EXIT_A;
// Get the num tests
int m ;
scanf("%d\n", &m);
char *num_buf = malloc(DOUBLE_ROW_LEN);
if (!num_buf) {
printf("Failed to allocate num_buffer");
goto EXIT_A;
}
int *nn;
int *start = calloc(m, sizeof(int));
int *stop = calloc(m, sizeof(int));
int *staptr = start;
int *stpptr = stop;
char *cptr;
for(int i=0; i<m; i++) {
fgets(num_buf, DOUBLE_ROW_LEN, stdin);
nn = staptr++;
cptr = num_buf-1;
while(*(++cptr) > '\n') {
if (*cptr == ' ')
nn = stpptr++;
else
*nn = *nn*10 + *cptr-'0';
}
}
// Count for each test
char *buf_end = strchr(buf, '\0');
int len, shift;
char outbuf[ROW_LEN];
char *ptr_l, *ptr_r, *out;
for(int i=0; i<m; i++) {
ptr_l = buf + start[i];
ptr_r = buf + stop[i];
while(ptr_r < buf_end && *ptr_l == *ptr_r) {
++ptr_l;
++ptr_r;
}
// Print length of same sequence
shift = len = (int)(ptr_l - (buf + start[i]));
out = outbuf;
do {
out++;
shift /= 10;
} while (shift);
*out = '\0';
do {
*(--out) = "0123456789"[len%10];
len /= 10;
} while(len);
puts(outbuf);
}
ret = 0;
free(start);
free(stop);
EXIT_A:
free(buf);
return ret;
}
Thanks to your question, I went and solved the problem myself. Your time is better than mine, but I'm still using some stdio functions.
I simply do not think the high score of 0.05 seconds is bona fide. I suspect it's the product of a highly automated system that returned that result in error, and that no one ever verified it.
How to defend that assertion? There's no real algorithmic complexity: the problem is O(n). The "trick" is to write specialized parsers for each aspect of the input (and avoid work done only in debug mode). The total time for 22 trials is 50 milliseconds, meaning each trial averages 2.25 ms? We're down near the threshold of measurability.
Competitions like the problem you addressed yourself to are unfortunate, in a way. They reinforce the naive idea that performance is the ultimate measure of a program (there's no score for clarity). Worse, they encourage going around things like scanf "for performance" while, in real life, getting a program to run correctly and fast basically never entails avoiding or even tuning stdio. In a complex system, performance comes from things like avoiding I/O, passing over the data only once, and minimizing copies. Using the DBMS effectively is often key (as it were), but such things never show up in programming challenges.
Parsing and formatting numbers as text does take time, and in rare circumstances can be a bottleneck. But the answer is hardly ever to rewrite the parser. Rather, the answer is to parse the text into a convenient binary form, and use that. In short: compilation.
That said, a few observations may help.
You don't need dynamic memory for this problem, and it's not helping. The problem statement says the input array may be up to 100,000 elements, and the number of trials may be as many as 100,000. Each trial is two integer strings of up to 6 digits each separated by a space and terminated by a newline: 6 + 1 + 6 + 1 = 14. Total input, maximum is 100,000 + 1 + 6 + 1 + 100,000 * 14: under 16 KB. You are allowed 1 GB of memory.
I just allocated a single 16 KB buffer, and read it in all at once with read(2). Then I made a single pass over that input.
You got suggestions to use asynchronous I/O and threads. The problem statement says you're measured on CPU time, so neither of those help. The shortest distance between two points is a straight line; a single read into statically allocated memory wastes no motion.
One ridiculous aspect of the way they measure performance is that they use gcc -g. That means assert(3) is invoked in code that is measured for performance! I couldn't get under 4 seconds on test 22 until I removed the my asserts.
In sum, you did pretty well, and I suspect the winner you're baffled by is a phantom. Your code does faff about a bit, and you can dispense with dynamic memory and tuning stdio. I bet your time can be trimmed by simplifying it. To the extent that performance matters, that's where I'd direct your attention.
You should allocate all your buffers continuously.
Allocate a buffer which is the size of all your buffers (num_buff, start, stop) then rearrange the points to the corresponding offsets by their size.
This can reduce your cache miss \ page faults.
Since the read and the write operation seems to consume a lot of time you should consider adding threads. One thread should deal with I\O and another should deal with the computation. (It is worth checking if another thread for prints could speed things up as well). Make sure you don't use any locks while doing this.
Answering this question is tricky because optimization heavily depends on the problem you have.
One idea is to look at the content of the file you are trying to read and see if there patterns or things that you can use in your favor.
The code you wrote is a "general" solution for reading from a file, executing something and then writing to a file. But if you the file is not randomly generated each time and the content is always the same why not try to write a solution for that file?
On the other hand, you could try to use low-level system functions. One that comes to my thinking is mmap which allows you to map a file directly to memory and access that memory instead of using scanf and fgets.
Another thing I found that might help is in your solutin you are having two while loops, why not try and use only one? Another thing would be to do some Asynchronous I/O reading, so instead of reading the whole file in a loop, and then doing the calculation in another loop, you can try and read a portion at the beginning, start processing it async and continue reading.
This link might help for the async part

How to measure wallclock time of file I/O

I'm writing several benchmark programs in C to achieve the following tasks:
The speed with which one can read from a network disk. Print the seconds needed to read 8192 bytes.
The speed with which one can read from the local directory /tmp on your local machine. Print the seconds needed to read 8192 bytes.
The speed with which one can read from the disk page cache. Print the seconds needed to read 8192 bytes.
The speed with which one can write to a network disk. Print the seconds needed to write 8192 bytes.
The speed with which one can write to the local directory /tmp on your local machine. Print the seconds needed to write 8192 bytes.
The goal here is to measure just the time that doing the file read or write takes (using read and write to avoid any buffering time from fread)
My general approach for 1 and 2 is to create a file of 8192 bytes and write that to disk (whether it be the local directory or the network disk) and then call sleep(10) to wait for the page cache to flush so that I'm measuring the time of the actual I/O, not the cache I/O. Then I measure the time it takes to do an empty for loop several thousand times, then the time it takes to read 8192 bytes in then subtract the two, divided over the average of all iterations. My code for that looks like:
struct timespec emptyLoop1, emptyLoop2;
clock_gettime(CLOCK_REALTIME, &emptyLoop1);
for(i = 0, j = 0; i < ITERATIONS; i++) {
j+=i*i;
}
clock_gettime(CLOCK_REALTIME, &emptyLoop2);
char readbuf[NUM_BYTES];
struct timespec beforeRead, afterRead;
clock_gettime(CLOCK_REALTIME, &beforeRead);
for(i = 0, j = 0; i < ITERATIONS; i++){
j+=i*i;
read(fd, readbuf, NUM_BYTES);
}
Is that sufficient for accurately measuring the times of reading from those locations?
Next, I'm confused as to how to read from the page cache. Where does that exist on disk and how do I access it? Finally, there's some tricks for 4 and 5 which are apparently much harder than they seem but I'm not sure what I'm missing.
Following is my file reading function, enabling choice of using or not using the memory based cache. If writing files first, similar open statements are needed. Note that direct I/O cannot be used over a LAN and caching can be unpredictable. More details and access to source codes and execution files can be found in http://www.roylongbottom.org.uk/linux_disk_usb_lan_benchmarks.htm.
int readFile(int use, int dsize)
{
int p;
if (useCache)
{
handle = open(testFile, O_RDONLY);
}
else
{
handle = open(testFile, O_RDONLY | O_DIRECT);
}
if (handle == -1)
{
printf (" Cannot open data file for reading\n\n");
fprintf (outfile, " Cannot open data file for reading\n\n");
fclose(outfile);
printf(" Press Enter\n");
g = getchar();
return 0;
}
for (p=0; p<use; p++)
{
if (read(handle, dataIn, dsize) == -1)
{
printf (" Error reading file\n\n");
fprintf (outfile, " Error reading file\n\n");
fclose(outfile);
close(handle);
printf(" Press Enter\n");
g = getchar();
return 0;
}
}
close(handle);
return 1;
}

Fastest file reading in C

Right now I am using fread() to read a file, but in other language fread() is inefficient i'v been told. Is this the same in C? If so, how would faster file reading be done?
It really shouldn't matter.
If you're reading from an actual hard disk, it's going to be slow. The hard disk is your bottle neck, and that's it.
Now, if you're being silly about your call to read/fread/whatever, and say, fread()-ing a byte at a time, then yes, it's going to be slow, as the overhead of fread() will outstrip the overhead of reading from the disk.
If you call read/fread/whatever and request a decent portion of data. This will depend on what you're doing: sometimes all want/need is 4 bytes (to get a uint32), but sometimes you can read in large chunks (4 KiB, 64 KiB, etc. RAM is cheap, go for something significant.)
If you're doing small reads, some of the higher level calls like fread() will actual help you by buffering data behind your back. If you're doing large reads, it might not be helpful, but switching from fread to read will probably not yield that much improvement, as you're bottlenecked on disk speed.
In short: if you can, request a liberal amount when reading, and try to minimize what you write. For large amounts, powers of 2 tend to be friendlier than anything else, but of course, it's OS, hardware, and weather dependent.
So, let's see if this might bring out any differences:
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#define BUFFER_SIZE (1 * 1024 * 1024)
#define ITERATIONS (10 * 1024)
double now()
{
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.;
}
int main()
{
unsigned char buffer[BUFFER_SIZE]; // 1 MiB buffer
double end_time;
double total_time;
int i, x, y;
double start_time = now();
#ifdef USE_FREAD
FILE *fp;
fp = fopen("/dev/zero", "rb");
for(i = 0; i < ITERATIONS; ++i)
{
fread(buffer, BUFFER_SIZE, 1, fp);
for(x = 0; x < BUFFER_SIZE; x += 1024)
{
y += buffer[x];
}
}
fclose(fp);
#elif USE_MMAP
unsigned char *mmdata;
int fd = open("/dev/zero", O_RDONLY);
for(i = 0; i < ITERATIONS; ++i)
{
mmdata = mmap(NULL, BUFFER_SIZE, PROT_READ, MAP_PRIVATE, fd, i * BUFFER_SIZE);
// But if we don't touch it, it won't be read...
// I happen to know I have 4 KiB pages, YMMV
for(x = 0; x < BUFFER_SIZE; x += 1024)
{
y += mmdata[x];
}
munmap(mmdata, BUFFER_SIZE);
}
close(fd);
#else
int fd;
fd = open("/dev/zero", O_RDONLY);
for(i = 0; i < ITERATIONS; ++i)
{
read(fd, buffer, BUFFER_SIZE);
for(x = 0; x < BUFFER_SIZE; x += 1024)
{
y += buffer[x];
}
}
close(fd);
#endif
end_time = now();
total_time = end_time - start_time;
printf("It took %f seconds to read 10 GiB. That's %f MiB/s.\n", total_time, ITERATIONS / total_time);
return 0;
}
...yields:
$ gcc -o reading reading.c
$ ./reading ; ./reading ; ./reading
It took 1.141995 seconds to read 10 GiB. That's 8966.764671 MiB/s.
It took 1.131412 seconds to read 10 GiB. That's 9050.637376 MiB/s.
It took 1.132440 seconds to read 10 GiB. That's 9042.420953 MiB/s.
$ gcc -o reading reading.c -DUSE_FREAD
$ ./reading ; ./reading ; ./reading
It took 1.134837 seconds to read 10 GiB. That's 9023.322991 MiB/s.
It took 1.128971 seconds to read 10 GiB. That's 9070.207522 MiB/s.
It took 1.136845 seconds to read 10 GiB. That's 9007.383586 MiB/s.
$ gcc -o reading reading.c -DUSE_MMAP
$ ./reading ; ./reading ; ./reading
It took 2.037207 seconds to read 10 GiB. That's 5026.489386 MiB/s.
It took 2.037060 seconds to read 10 GiB. That's 5026.852369 MiB/s.
It took 2.031698 seconds to read 10 GiB. That's 5040.119180 MiB/s.
...or no noticeable difference. (fread is winning sometimes, sometimes read)
Note: The slow mmap is surprising. This might be due to me asking it to allocate the buffer for me. (I wasn't sure about requirements of supplying a pointer...)
In really short: Don't prematurely optimize. Make it run, make it right, make it fast, that order.
Back by popular demand, I ran the test on a real file. (The first 675 MiB of the Ubuntu 10.04 32-bit desktop installation CD ISO) These were the results:
# Using fread()
It took 31.363983 seconds to read 675 MiB. That's 21.521501 MiB/s.
It took 31.486195 seconds to read 675 MiB. That's 21.437967 MiB/s.
It took 31.509051 seconds to read 675 MiB. That's 21.422416 MiB/s.
It took 31.853389 seconds to read 675 MiB. That's 21.190838 MiB/s.
# Using read()
It took 33.052984 seconds to read 675 MiB. That's 20.421757 MiB/s.
It took 31.319416 seconds to read 675 MiB. That's 21.552126 MiB/s.
It took 39.453453 seconds to read 675 MiB. That's 17.108769 MiB/s.
It took 32.619912 seconds to read 675 MiB. That's 20.692882 MiB/s.
# Using mmap()
It took 31.897643 seconds to read 675 MiB. That's 21.161438 MiB/s.
It took 36.753138 seconds to read 675 MiB. That's 18.365779 MiB/s.
It took 36.175385 seconds to read 675 MiB. That's 18.659097 MiB/s.
It took 31.841998 seconds to read 675 MiB. That's 21.198419 MiB/s.
...and one very bored programmer later, we've read the CD ISO off disk. 12 times. Before each test, the disk cache was cleared, and during each test there was enough, and approximately the same amout of, RAM free to hold the CD ISO twice in RAM.
One note of interest: I was originally using a large malloc() to fill memory and thus minimize the effects of disk caching. It may be worth noting that mmap performed terribly here. The other two solutions merely ran, mmap ran and, for reasons I can't explain, began pushing memory to swap, which killed its performance. (The program was not leaking, as far as I know (the source code is above) - the actual "used memory" stayed constant throughout the trials.)
read() posted the fastest time overall, fread() posted really consistent times. This may have been to some small hiccup during the testing, however. All told, the three methods were just about equal. (Especially fread and read...)
If you are willing to go beyond the C spec into OS specific code, memory mapping is generally considered the most efficient way.
For Posix, check out mmap and for Windows check out OpenFileMapping
What's slowing you down?
If you need the fastest possible file reading (while still playing nicely with the operating system), go straight to your OS's calls, and make sure you study how to use them most effectively.
How is your data physically laid out? For example, rotating drives might read data stored at the edges faster, and you want to minimize or eliminate seek times.
Is your data pre-processed? Do you need to do stuff between loading it from disk and using it?
What is the optimum chunk size for reading? (It might be some even multiple of the sector size. Check your OS documentation.)
If seek times are a problem, re-arrange your data on disk (if you can) and store it in larger, pre-processed files instead of loading small chunks from here and there.
If data transfer times are a problem, perhaps consider compressing the data.
I'm thinking of the read system call.
Keep in mind that fread is a wrapper for 'read'.
On the other hand fread has an internal buffer, so 'read' may be faster but i think 'fread' will be more efficient.
If fread is slow it is because of the additional layers it adds to the underlying operating system mechanism to read from a file that interfere with how your particular program is using fread. In other words, it's slow because you aren't using it the way it has been optimized for.
Having said that, faster file reading would be done by understanding how the operating system I/O functions work and providing your own abstraction that handles your program's particular I/O access patterns better. Most of the time you can do this with memory mapping the file.
However, if you are hitting the limits of the machine you are running on, memory mapping probably won't be sufficient. At that point it's really up to you to figure out how to optimize your I/O code.
It's not the fastest but it's pretty good and short.
#include <fcntl.h>
#include <unistd.h>
int main() {
int f = open("file1", O_RDWR);
char buffer[4096];
while ( read(f, buffer, 4096) > 0 ) {
printf("%s", buffer);
}
}
The problem that some people have noted here, is that depending on your source, your target buffer size, etc, you can create a custom handler for that specific case, but there are other cases, like block/character devices, i.e. /dev/* where standard rules like that do or don't apply and your backing source might be something that pops character off serially without any buffering, like an I2C bus, standard RS-232, etc. And there are some other sources where character devices are memory mappable large sections of memory like nvidia does with their video driver character device (/dev/nvidiactl).
One other design implementation that many people have chosen in high-performance applications is asynchronous instead of synchronous I/O for handling how data is read. Look into libaio, and the ported versions of libaio which provide prepackaged solutions for asynchronous I/O, as well as look into using read with shared memory between a worker and consumer thread (but keep in mind that this will increase programming complexity if you go this route). Asynchronous I/O is also something that you can't get out of the box with stdio that you can get with standard OS system calls. Just be careful as there are bits of read which are `portable' according to the spec, but not all operating systems (like FreeBSD for instance) support POSIX STREAMs (by choice).
Another thing that you can do (depending on how portable your data is) is look into compression and/or conversion into a binary format like database formats, i.e. BDB, SQL, etc. Some database formats are portable across machines using endianness conversion functions.
In general it would be best to take a set of algorithms and methods, run performance tests using the different methods, and evaluate the best algorithm that serves the mean task that your application would serve. That would help you determine what the best performing algorithm is.
Maybe check out how perl does it. Perl's I/O routines are optimized, and are, I gather, the reason why processing text with a perl filter can be twice as fast as doing the same transformation with sed.
Obviously perl is pretty complex, and I/O is only one small part of what it does. I've never looked at its source so I couldn't give you any better directions than to point you here.

Resources