I need to parse a huge document and one of the queries requires me to count the words in certain strings of the document. Those strings usually have between 2000 and 30000 words and my program takes ~12 seconds just to parse it all. The query that takes the longest is unsurprisingly the query which requires a word counting.
I tried using pipes and a fork to try accelerate the process.
How it works:
I take the string and divide it by two. If I happen to divide a word in two - if text[i] != ' ' etc - then the left side of the divided text keeps looking to the left until it encounters a space and only counts words until it reaches that space. The right side counts that half word as a full word and keeps counting until it reaches the end of the string. If I divide between spaces the cycle just doesn't happen and the program proceeds to the next step.
Edit: could be a space or a \n or a \t
After that I do a fork and communicate between forks through a pipe. What goes through the pipe is the word count of one of the halves of the text. It then is added to the word count of the other half and the total is returned.
The problem:
On a test code example, it doesn't seem to help at all. The execution time still seems to be the same as if I did it all in one go.
The big problem
This function is meant to be ran around 60000 times throughout the parsing. And my program takes too long to execute, in fact I had to cancel it after 2 minutes...
Where do I need help?
I need help in knowing exactly why is my function:
a) not even getting slightly faster with this supposedly dual core implementation compared to the single core one.
b) taking so long in the actual program
I hope this isn't a problem with C and forks/pipes are just too slow for what I want and I hope I just don't know something.
--
Here's the code!
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
long count(char* xStr) {
long num = 0;
// state:
const char* iterar = (const char*) xStr;
int in_palavra = 0;
do switch(*iterar) {
case '\0':
case ' ': case '\t': case '\n':
if (in_palavra) { in_palavra = 0; num++; }
break;
default: in_palavra = 1;
} while(*iterar++);
return num;
}
long wordCounter(char* text) {
int LHalf = strlen(text)/2;
int DHalf = LHalf;
while(text[LHalf] != ' ' && text[LHalf] != '\n' && text[LHalf] != '\t') {
if(LHalf > 0){
LHalf--;
}
else break;
}
char* lft = malloc(LHalf);
char* rgt = malloc(DHalf);
strncpy(lft, text, LHalf);
strncpy(rgt, text + DHalf, DHalf);
int fd[2];
pid_t childpid;
pipe(fd);
long size_left;
long size_right;
if((childpid = fork()) == -1) {
perror("Error in fork");
}
if(childpid == 0) {
close(fd[0]);
size_left = count(lft);
int w = write(fd[1], &size_left, sizeof(long));
close(fd[1]); //desnecessario
exit(0);
}
else {
close(fd[1]);
int r = read(fd[0], &size_left, sizeof(long));
size_right = count(rgt);
close(fd[0]);
wait(0);
}
long total = size_right + size_left;
free(lft);
free(rgt);
return total;
}
int main(int argc, char const *argv[]) {
long num = wordCounter("aaa aaa aa a a a a a a sa sa as sas sa sa saa sa sas aa sa sas sa sa"); //23 words
printf("%ld\n", num);
return 0;
}
To follow up on my comment above:
If I/O is your bottleneck:
Consider passing the filename into your word counting program, then managing the disc I/O yourself with simple fread() and fwrite() calls that read the whole file in at once. From the sound of it, your files should fit into memory reasonable at only 300k words - maybe worst case 3Meg files? That should read into memory very quickly.
Then, do your word counting magic on the data. My guess is that you won't even need to worry about threads or the like as scanning through memory should be nearly instant for your task. Heck, I bet even using strtok() looking for spaces and punctuation may be good enough.
But if I am wrong, the good news is that this data can easily be divided into multiple parts and passed to individual pthreads to count the data and then be collected and added when done.
If I/O is not, then the above exercise will show no gain at all, but at least it can be coded pretty quickly as a test case.
Related
I wrote this piece of code to show the basic working of how I would like to send some data (Strings) from the parent process to the child process. But I seem to have some problems. (I removed all error checking to make the code more readable)
When I run this piece of code I expect to see the two test strings to be displayed on the terminal, but I only see the first one.
When I uncomment the first “sleep(1)”, then I see both strings displayed.
But when I uncomment only the second “sleep(1)”, then I again only see the first string.
I suspect this problem has something to do with synchronization. That the strings get written to fast and the fifo write end closes before everything is read by the child process. That’s why we see the correct output when we introduce a sleep between the two write() commands.
But what I don’t understand is that we still get a faulty output when we only introduce a sleep after both write commands. Why can’t the child read both strings even if they are both written before it can read one?
How can I solve this problem? Do I need some synchronization code, and if so how should I implement this. Because I won’t write a “sleep(1)” after every write command.
And is the solution also viable for multiple processes that want to write to the same fifo? (but with still only one process that reads from the fifo)
#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
int main(int argc, char const *argv[]) {
mkfifo("test_fifo", 0666);
pid_t pid = fork();
if (pid == 0) {
int fd = open("test_fifo", O_RDONLY);
char data[400];
int rc;
do {
rc = read(fd, data, 400);
if (rc > 0) printf("Received data: %s\n", data);
} while(rc > 0);
}
else {
int fd = open("test_fifo", O_WRONLY);
char * string1 = "This is the first test string";
write(fd, string1, strlen(string1) + 1);
//sleep(1);
char * string2 = "This is the second test string";
write(fd, string2, strlen(string2) + 1);
//sleep(1);
close(fd);
wait(NULL);
}
return 0;
}
You are receiving both strings at the same time at the first call to read. Because %s prints up until a zero byte, the second string is just not displayed. The poor mans synchronization with sleep(1) allows child to "catch" the messages in two distinct read call.
read returns the count of bytes read. Use that number. Change the parent code to:
ssize_t rc;
do {
rc = read(fd, data, 400);
if (rc > 0) {
printf("Received data: ");
for (size_t i = 0; i < rc; ++i) {
if (data[i] == '\0') {
printf("\\x00");
continue;
}
printf("%c", data[i]);
}
printf("\n");
}
} while(rc >= 0);
and it shows on my pc:
Received data: This is the first test string\x00This is the second test string\x00
Why can’t the child read both strings even if they are both written before it can read one?
Well, the problem is not in reading, it's how you are displaying the data you read. (Still, reading could be improved, one should handle that pesky EAGAIN errno code).
How can I solve this problem?
If you want 1:1 relationship between read/write use a constant size packets or generally you have to know in advance how many bytes you want to read. Bytes written are "concatenated" together and lose structure. Or use pipe(3p) on which messages with size smaller then PIPE_BUF are guaranteed to be atomic. Or you could use POSIX message queue mq_receive/mq_send.
Or write a proper "deserializer", something that will buffer data and keep internal state and notify higher level only when a whole "message" was received, ie. detect when a zero byte was received in the stream of bytes and restore structure the the stream of bytes.
This is my first post on stackoverflow and my native language is not English. Please excuse me for any inconvenience this post brings to you. Maybe it's a little long, so I am looking forward to your patience. Thanks in advance!
I have a C language code snippet. The job is counting the number of words in two files. I use pthreads to solve this problem. But I find the order of these two statements
count_words(argv[1]);
pthread_create(&t1, NULL, count_words, (void *)argv[2]);
affects the program performance, which it's opposite to what I expected. Here is the code:
#include <stdio.h>
#include <pthread.h>
#include <ctype.h>
#include <stdlib.h>
int total_words;
int main(int argc, char *argv[]) {
pthread_t t1;
void *count_words(void *);
if (argc != 3) {
printf("usage: %s file1 file2\n", argv[0]);
exit(1);
}
total_words = 0;
count_words(argv[1]); // program runs faster when executing this first
pthread_create(&t1, NULL, count_words, (void *)argv[2]);
pthread_join(t1, NULL);
printf("%5d: total words\n", total_words);
return 0;
}
void *count_words(void *f) {
char *filename = (char *)f;
FILE *fp;
int c, prevc = '\0';
if ((fp = fopen(filename, "r")) == NULL) {
perror(filename);
exit(1);
}
while ((c = getc(fp)) != EOF) {
if (!isalnum(c) && isalnum(prevc))
total_words++;
prevc = c;
}
fclose(fp);
return NULL;
}
Performance:
I run program using "test program_name" on command line to test the running speed. The output is:
If the order like this:
count_words(argv[1]);
pthread_create(&t1, NULL, count_words, (void *)argv[2]);
program runs fast: real 0.014s
If like this:
pthread_create(&t1, NULL, count_words, (void *)argv[2]);
count_words(argv[1]);
program runs slow: real 0.026s
What I expected:
On case 1, program runs count_word() first. After completing the counting job will it continue to run pthread_create(). At that time, the new thread will help do the counting job. So, the new thread does the job after the origin thread completes the job, which is sequential running instead of parallel running. On case 2, program runs pthread_create() first before any counting, so after that there are two threads parallel do the counting. So I expect the case 2 is faster than case 1. But I am wrong. Case 2 is slower. Could anybody give me some useful info on this?
Note
Please ignore that I don't put a mutex lock on the global variable total_words. This is not the part I am concerned about. And the program is just for testing. Please forgive its imperfections.
Edit 1
Below is the supplement and improvement after I read some suggestions.
a) Supplement: The processor is Intel® Celeron(R) CPU 420 # 1.60GHz. One core.
b) Improvement: I have improved my example, two changes:
1) I enlarged the files. file1 is 2080651 bytes (about 2M), file2 is the copy of file1.
2) I modified count_words(). When reaching the file end, use fseek() to set the fp to the beginning and count again. Repeatedly counts COUNT times. Define COUNT 20. Below is the changed code:
#define COUNT 20
// other unchanged codes ...
void *count_words(void *f) {
// other unchanged codes ...
int i;
for (i = 0; i < COUNT; i++) {
while ((c = getc(fp)) != EOF) {
if (!isalnum(c) && isalnum(prevc))
total_words++;
prevc = c;
}
fseek(fp, 0, SEEK_SET);
}
fclose(fp);
return NULL;
}
Output of fast_version (count_word() first) and slow_version (pthread_create() first):
administrator#ubuntu:~$ time ./fast_version file1 file2
12241560: total words
real 0m5.057s
user 0m4.960s
sys 0m0.048s
administrator#ubuntu:~$ time ./slow_version file1 file2
12241560: total words
real 0m7.636s
user 0m7.280s
sys 0m0.048s
I tried the "time progname file1 file2" command a few times. Maybe there is some difference on tenth or hundredth of a second on each run. But the differences are not much.
Edit 2
This part is added after I have done some experiments according to some hints --
When you launch the second thread after the first thread completes it's execution, there is no context switching overhead.
--by user315052.
The experiment is that I improved the count_word():
void *count_word(void *f) {
// unchanged codes
// ...
for (i = 0; i < COUNT; i++) {
while ((c = getc(fp)) != EOF) {
if (!isalnum(c) && isalnum(prevc))
total_words++;
prevc = c;
}
fseek(fp, 0, SEEK_SET);
printf("from %s\n", filename); // This statement is newly added.
}
// unchanged codes
// ...
}
Add statement " printf("from %s\n", filename); " , so I can tell which file (or thread) is running at that time. The output of fast version is 20 times " from file1 ", then 20 times " from file2 ", and the slow version is " from file1 " and " from file2 " mixed printed.
It looks like fast version is faster because there is no context switching. But fact is that after count_word() finished, the original thread was not dead, but created a new thread and waited for it to terminate. Is there no context switching when new thread is running? I watched the screen closely and found the printing speed of " from file2 " is apparently slower than " from file1 ". Why? Is it because context switching happened when counting from file2?
For the slow version, we can see from the output the printing speed of " from file1 " and " from file2 " is even slower than the printing speed of " from file2 " in fast version, because its context switching costs more time on parallel counting, while in fast version the context switching is not so heavy as one of the threads has finished its job and just waiting.
So I think the main reason is fast version has a light and easy context switching against the slow version. But the "printing speed" is from my observation, and may be not so strict. So I am not sure about it.
In a comment, you wrote:
The processor is Intel® Celeron(R) CPU 420 # 1.60GHz. One core.
Since you only have one core, your thread executions are serialized anyway. Your program with two threads running concurrently pays the overhead of thread context switching as each performs blocking I/O.
When you launch the second thread after the first thread completes it's execution, there is no context switching overhead.
Try to do the same measure but run your program a 100 times and calculate the average time, with such a short time the effect of caching is far from neglictible for an example.
How did you measure?
Real time is not an indication of how long your program ran. You have to measure user+system time. What's more, meaningful timing at the millisecond level depends very much on the granularity of your timing clock. If it runs, say, at 60Hz, then you have a problem.
Coming up with meaningful benchmarks is an art...
As a start, you should find a way to run your threads in a loop, say, 10.000 times and add up numbers. That'll at least get you out of the millisecond timing problem.
I am writing a program that should search through a directory that the user supplies in order to find all files in said directory that were accessed, modified, or changed within a day of a given time. I am having two definite problem and perhaps another one.
The first problem is that I can only get the program to do shallow searches, it won't look through any subdirectories. I am sure it has to do with what I concatenate to the directory buffer ( right now it is .). The second problem is that it is not searching every file, though it does look at most of them - I think this goes back to problem one though. The third "problem" is that when I check the access time of each file, it seems as though they are all the same (though I don't have this problem with changed/modified time). I am running on Ubuntu through VM, if that might be affecting the access times.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <glob.h>
/* Function that checks if specified file was accessed, modified, or
changed within a day of the specified time. */
void checkFile(long long time, char * fileName) {
struct stat *s;
time_t accessTime;
time_t modTime;
time_t changeTime;
s = malloc(sizeof(struct stat));
if(stat(fileName,s) == 0) {
accessTime = s->st_atime;
modTime = s->st_mtime;
changeTime = s->st_ctime;
if((time - accessTime) <= 86400 && (time - accessTime) >= -86400)
printf("%s\n",fileName);
else if((time - modTime) <= 86400 && (time - modTime) >= -86400)
printf("%s\n",fileName);
else if((time - changeTime) <= 86400 && (time - changeTime) >= -86400)
printf("%s\n",fileName);
}
free(s);
}
void searchDirectory(long long time, glob_t globbuf) {
if(globbuf.gl_pathc == 0)
printf("there were no matching files");
else {
int i;
for(i = 0; i < globbuf.gl_pathc; i++)
checkFile(time,globbuf.gl_pathv[i]);
}
}
int main(int argc, char** argv) {
long long time = atol(argv[1]);
char * buf = argv[2];
strcat(buf,"*.*");
glob_t globbuf;
glob(buf, 0, NULL, &globbuf);
searchDirectory(time,globbuf);
globfree(&globbuf);
return 0;
}
Thanks for your time!
You should not
cat(buf, "*.*");
...since 'buf' is a pointer to a string provided by the os - you don't know if that buffer is large enough to hold the extra text you are adding. You could allocate a large buffer, copy the contents of argv[2] into it and then append "*.*", but to be really safe you should determine the length of argv[2] and ensure your buffer is large enough.
You can use the st_mode member of the struct stat structure to determine if the file is a directory (check if it equals S_IFDIR). If it is, you could make it the current directory and as jonsca suggested, call your searchDirectory function again. But when using recursion you usually want to ensure you have a limit on the depth of recursion, or you can overflow the stack. This is a kind of 'depth first search'. The solution I prefer is to do a 'breadth first search' using a queue: basically push your first glob onto the start of a list, then repeatedly take the first item off that list and search it, adding new directories to the end of the list as you go, until the list is empty.
When evaluating programs like this, teachers love to award extra points for those that don't blow their stack too easily :)
P.S. I'm guessing the access time issue is a VM/filesystem/etc incompatibility and not your fault.
I'm working on a class project in which I must write a command line shell with the following requirements:
The shell must able to read buffered input
Buffer should be 64 characters
Error conditions should be handled
Exceeded buffer size
Interruptions (when a signal arrives) – see the man page for read()
Invalid input (unparsable characters, blank lines, etc)
Any other error that may be encountered.
Shell must have a history of at least 20 items, and the history must not be of a static size. When the history buffer is full, the oldest item is removed and the newest item added.
Programs should be able to run in the foreground or background. (using &)
Ctrl-D will exit the shell
Ctrl-C will print the complete history
The Command ‘history’ will also print the complete history. Newest items will be at the bottom of the list.
All other signals will be trapped and displayed to the user in the shell
Program will use the read() command to read in input, unless the arrow keys are supported
I have opted to implement arrow keys for history cycling, so I'm using ncurses for input, rather than read(). I think I'm doing all right using strtok() to parse input, and fork() and execvp() to run the processes, but I'm not doing all right implementing ncurses correctly. All I've gotten it to do so far is init a new screen, display the prompt, then segfault upon any key press. Not good.
I reckon the problem must be in my design. I'm not wrapping my head around ncurses too well. What sort of data structures should I be using for this project? How should I handle the ncurses setup, teardown, and everything in between? What's the deal with windows and screens, and should I have a single globally accessible window/screen that I work with? Also, I've been trying to use a char* for the input buffer, and a char** for the command history, but I have no experience in C, so despite reading up on malloc, calloc, and realloc, I'm not sure of the best way to store commands in the buffer and the history. Any tips on managing these char arrays?
tl;dr: How do I use ncurses correctly to make a command line shell, and how do I handle the command memory management with C?
I realize this is a pretty hefty question. :(
edit: I have already seen http://www.gnu.org/software/libc/manual/html_node/Implementing-a-Shell.html and http://www.linuxinfor.com/english/NCURSES-Programming/ but the ncurses documentation has actually too much overhead. I just want to use its ability to recognize arrow keys.
Here's some sample code which:
Performs dynamic memory allocation.
Reads from the console in non-blocking mode.
Uses VT100 codes to print a frame buffer to the console.
It compiles on Linux using GCC without warnings or errors. It's far from bug free, but it should give you some ideas of what's possible. Compile and run it, pressing [up] and [down] will print messages, typing characters and hitting [enter] will "execute" the command.
#include <poll.h>
#include <signal.h>
#include <stdio.h>
#include <termios.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
/** VT100 command to clear the screen. Use puts(VT100_CLEAR_SCREEN) to clear
* the screen. */
#define VT100_CLEAR_SCREEN "\033[2J"
/** VT100 command to reset the cursor to the top left hand corner of the
* screen. */
#define VT100_CURSOR_TO_ORIGIN "\033[H"
struct frame_s
{
int x;
int y;
char *data;
};
static int draw_frame(struct frame_s *frame)
{
int row;
char *data;
int attrib;
puts(VT100_CLEAR_SCREEN);
puts(VT100_CURSOR_TO_ORIGIN);
for (row = 0, data = frame->data; row < frame->y; row++, data += frame->x)
{
/* 0 for normal, 1 for bold, 7 for reverse. */
attrib = 0;
/* The VT100 commands to move the cursor, set the attribute, and the
* actual frame line. */
fprintf(stdout, "\033[%d;%dH\033[0m\033[%dm%.*s", row + 1, 0, attrib, frame->x, data);
fflush(stdout);
}
return (0);
}
int main(void)
{
const struct timespec timeout = { .tv_sec = 1, .tv_nsec = 0 };
struct frame_s frame;
struct termios tty_old;
struct termios tty_new;
unsigned char line[128];
unsigned int count = 0;
int ret;
struct pollfd fds[1];
sigset_t sigmask;
struct tm *tp;
time_t current_time;
/* Set up a little frame. */
frame.x = 80;
frame.y = 5;
frame.data = malloc(frame.x * frame.y);
if (frame.data == NULL)
{
fprintf(stderr, "No memory\n");
exit (1);
}
memset(frame.data, ' ', frame.x * frame.y);
/* Get the terminal state. */
tcgetattr(STDIN_FILENO, &tty_old);
tty_new = tty_old;
/* Turn off "cooked" mode (line buffering) and set minimum characters
* to zero (i.e. non-blocking). */
tty_new.c_lflag &= ~ICANON;
tty_new.c_cc[VMIN] = 0;
/* Set the terminal attributes. */
tcsetattr(STDIN_FILENO, TCSANOW, &tty_new);
/* Un-mask all signals while in ppoll() so any signal will cause
* ppoll() to return prematurely. */
sigemptyset(&sigmask);
fds[0].events = POLLIN;
fds[0].fd = STDIN_FILENO;
/* Loop forever waiting for key presses. Update the output on every key
* press and every 1.0s (when ppoll() times out). */
do
{
fds[0].revents = 0;
ret = ppoll(fds, sizeof(fds) / sizeof(struct pollfd), &timeout, &sigmask);
if (fds[0].revents & POLLIN)
{
ret = read(STDIN_FILENO, &line[count], sizeof(line) - count);
if (ret > 0)
{
line[count + ret] = '\0';
if (strcmp(&line[count], "\033[A") == 0)
{
snprintf(frame.data, frame.x, "up");
count = 0;
}
else if (strcmp(&line[count], "\033[B") == 0)
{
snprintf(frame.data, frame.x, "down");
count = 0;
}
else if (line[count] == 127) // backspace
{
if (count != 0) { count -= ret;}
}
else if (line[count] == '\n')
{
snprintf(frame.data, frame.x, "entered: %s", line);
count = 0;
}
else
{
count += ret;
}
}
}
/* Print the current time to the output buffer. */
current_time = time(NULL);
tp = localtime(¤t_time);
strftime(&frame.data[1 * frame.x], frame.x, "%Y/%m/%d %H:%M:%S", tp);
/* Print the command line. */
line[count] = '\0';
snprintf(&frame.data[(frame.y - 1) * frame.x], frame.x, "$ %s", line);
draw_frame(&frame);
}
while (1);
/* Restore terminal and free resources. */
tcsetattr(STDIN_FILENO, TCSANOW, &tty_old);
free(frame.data);
return (0);
}
If your input buffer is defined to be 64 characters, then I would recommend using a char array instead of a char*. Something like char input_buffer[65]; should serve your purposes (add an extra character for the trailing '\0').
As far as command history goes, you can use a two-dimensional array for that. Something like char command_history[20][65]; should let you store 20 old commands of 64 characters each.
Allocating these buffers statically should make things a bit easier for you, as you won't have to worry about malloc and friends.
It's hard to give you too much specific advice without seeing your code. I have a feeling that you are making the same type of mistakes that are typical to people first learning C. Can you post the part of your code that is giving you problems so that we can learn more about what you are doing?
Update after posted provided code:
One problem I'm seeing is that the function takeInput doesn't have a return statement. When you use input = takeInput(); inside your main function, the value of input isn't being set to what you think it is. It's probably not a valid pointer, which is causing your line that says input[j] to segfault.
Your usage of cmdHistory also needs revisiting. You allocate it with cmdHistory = (char**)calloc(21,sizeof(int));, which gives you enough space to store 21 integers. In the function printHistory, you pass elements of cmdHistory to printw as if they were strings (they're only integers). This is most definitely not doing what you want it to do. Instead, your allocation logic for cmdHistory needs to look more like your de-allocation logic (except backwards). Allocate an array of char**, then iterate through the array, assigning each pointer to a newly-allocated buffer. Just like you have one free statement for each element in the array plus a free for the array as a whole, you should have one malloc for each element plus one malloc for the array as a whole.
Even if you can't use a statically-allocated stack, try writing your program using one anyway. This will let you work the kinks out of your key detection logic, etc without having to worry about the dynamic memory part of the program. Once the rest of it is working, go back in and swap out the static memory for dynamic memory allocation. That way, you're only having to debug a little bit at a time.
Have you looked at the Readline library? It's ideal for use in your project.
http://cnswww.cns.cwru.edu/php/chet/readline/rltop.html
Hmm i wonder whether is a way to read a FILE faster than using fscanf()
For example suppose that i have this text
4
55 k
52 o
24 l
523 i
First i want to read the first number which gives us the number of following lines.
Let this number be called N.
After N, I want to read N lines which have an integer and a character.
With fscanf it would be like this
fscanf(fin,"%d %c",&a,&c);
You do almost no processing so probably the bottleneck is the file system throughput. However you should measure first if it really is. If you don't want to use a profiler, you can just measure the running time of your application. The size of input file divided by the running time can be used to check if you've reached the file system throughput limit.
Then if you are far away from aforementioned limit you probably need to optimize the way you read the file. It may be better to read it in larger chunks using fread() and then process the buffer stored in memory with sscanf().
You also can parse the buffer yourself which would be faster than *scanf().
[edit]
Especially for Drakosha:
$ time ./main1
Good entries: 10000000
real 0m3.732s
user 0m3.531s
sys 0m0.109s
$ time ./main2
Good entries: 10000000
real 0m0.605s
user 0m0.496s
sys 0m0.094s
So the optimized version makes ~127MB/s which may be my file system's bottleneck or maybe OS caches the file in RAM. The original version is ~20MB/s.
Tested with a 80MB file:
10000000
1234 a
1234 a
...
main1.c
#include <stdio.h>
int ok = 0;
void processEntry(int a, char c) {
if (a == 1234 && c == 'a') {
++ok;
}
}
int main(int argc, char **argv) {
FILE *f = fopen("data.txt", "r");
int total = 0;
int a;
char c;
int i = 0;
fscanf(f, "%d", &total);
for (i = 0; i < total; ++i) {
if (2 != fscanf(f, "%d %c", &a, &c)) {
fclose(f);
return 1;
}
processEntry(a, c);
}
fclose(f);
printf("Good entries: %d\n", ok);
return (ok == total) ? 0 : 1;
}
main2.c
#include <stdio.h>
#include <stdlib.h>
int ok = 0;
void processEntry(int a, char c) {
if (a == 1234 && c == 'a') {
++ok;
}
}
int main(int argc, char **argv) {
FILE *f = fopen("data.txt", "r");
int total = 0;
int a;
char c;
int i = 0;
char *numberPtr = NULL;
char buf[2048];
size_t toProcess = sizeof(buf);
int state = 0;
int fileLength, lengthLeft;
fseek(f, 0, SEEK_END);
fileLength = ftell(f);
fseek(f, 0, SEEK_SET);
fscanf(f, "%d", &total); // read the first line
lengthLeft = fileLength - ftell(f);
// read other lines using FSM
do {
if (lengthLeft < sizeof(buf)) {
fread(buf, lengthLeft, 1, f);
toProcess = lengthLeft;
} else {
fread(buf, sizeof(buf), 1, f);
toProcess = sizeof(buf);
}
lengthLeft -= toProcess;
for (i = 0; i < toProcess; ++i) {
switch (state) {
case 0:
if (isdigit(buf[i])) {
state = 1;
a = buf[i] - '0';
}
break;
case 1:
if (isdigit(buf[i])) {
a = a * 10 + buf[i] - '0';
} else {
state = 2;
}
break;
case 2:
if (isalpha(buf[i])) {
state = 0;
c = buf[i];
processEntry(a, c);
}
break;
}
}
} while (toProcess == sizeof(buf));
fclose(f);
printf("Good entries: %d\n", ok);
return (ok == total) ? 0 : 1;
}
It is unlikely you can significantly speed-up the actual reading of the data. Most of the time here will be spent on transferring the data from disk to memory, which is unavoidable.
You might get a little speed-up by replacing the fscanf call with fgets and then manually parsing the string (with strtol) to bypass the format-string parsing that fscanf has to do, but don't expect any huge savings.
In the end, it is usually not worth it to heavily optimise I/O operations, because they will typically be dominated by the time it takes to transfer the actual data to/from the hardware/peripherals.
As usual, start with profiling to make sure this part is indeed a bottleneck. Actually, FileSystem cache should make the small reads that you are doing not very expensive, however reading larger parts of the file to memory and then operating on memory might be (a little) faster.
In case (which i believe is extremely improbable) is that you need to save every CPU cycle, you might write your own fscanf variant, since you know the format of the string and you only need to support only one variant. But this improvement would bring low gains also, especially on modern CPUs.
The input looks like in various programming contests. In this case - optimize the algorithm, not the reading.
fgets() or fgetc() are faster, as they don't need to drag the whole formatting/variable argument list ballet of fscanf() into the program. Either one of those two functions will leave you with a manual character(s)-to-integer conversion however. Still, the program as whole will be much faster.
Not much hope to read file faster as it is a system call. But there is many ways to parse it faster than scanf with specialised code.
Checkout read and fread. As you practice for programming contests, you can ignore all warnings about disk IO buttle neck, cause files can be in memory or pipes from other processes generating tests "on-the-fly".
Put your tests into /dev/shm (new solution for tmpfs) or make test generator and pipe it.
I've found on programming contests, parsing numbers in manner to atoi can give much performance boost over scanf/fscanf (atoi might be not present, so be prepared to implement it by hand - it's easy).