Very Slow Data Processing

Very Slow Data Processing - c

Consider the following code that loads a dataset of records into a buffer and creates a Record object for each record. A record constitutes one or more columns and this information is uncovered at run-time. However, in this particular example, I have set the number of columns to 3.
typedef unsigned int uint;
typedef struct
{
uint *data;
} Record;
Record *createNewRecord (short num_cols);
int main(int argc, char *argv[])
{
time_t start_time, end_time;
int num_cols = 3;
char *relation;
FILE *stream;
int offset;
char *filename = "file.txt";
stream = fopen(filename, "r");
fseek(stream, 0, SEEK_END);
long fsize = ftell(stream);
fseek(stream, 0, SEEK_SET);
if(!(relation = (char*) malloc(sizeof(char) * (fsize + 1))))
printf((char*)"Could not allocate buffer");
fread(relation, sizeof(char), fsize, stream);
relation[fsize] = '\0';
fclose(stream);
char *start_ptr = relation;
char *end_ptr = (relation + fsize);
while (start_ptr < end_ptr)
{
Record *new_record = createNewRecord(num_cols);
for(short i = 0; i < num_cols; i++)
{
sscanf(start_ptr, " %u %n",
&(new_record->data[i]), &offset);
start_ptr += offset;
}
}
Record *createNewRecord (short num_cols)
{
Record *r;
if(!(r = (Record *) malloc(sizeof(Record))) ||
!(r->data = (uint *) malloc(sizeof(uint) * num_cols)))
{
printf(("Failed to create new a record\n");
}
return r;
}
This code is highly inefficient. My dataset contains around 31 million records (~1 GB) and this code processes only ~200 records per minute. The reason I load the dataset into a buffer is because I'll later have multiple threads process the records in this buffer and hence I want to avoid files accesses. Moreover, I have a 48 GB RAM, so the dataset in memory should not be a problem. Any ideas on how can to speed things up??
SOLUTION: the sscanf function was actually extremely slow and inefficient.. When I switched to strtoul, the job finishes in less than a minute. Malloc-ing ~ 3 million structs of type Record took only few seconds.

Confident that a lurking non-numeric data exist in the file.
int offset;
...
sscanf(start_ptr, " %u %n", &(new_record->data[i]), &offset);
start_ptr += offset;
Notice that if the file begins with non-numeric input, offset is never set and if it had the value of 0, start_ptr += offset; would never increment.
If a non-numeric data exist later in the file like "3x", offset will get the value of 1, and cause the while loop to proceed slowly for it will never get an updated value.
Best to check results of fread(), ftell() and sscanf() for unexpected return values and act accordingly.
Further: long fsizemay be too small a size. Look to using fgetpos() and fsetpos().
Note: to save processing time, consider using strtoul() as it is certainly faster than sscanf(" %u %n"). Again - check for errant results.
BTW: If code needs to uses sscanf(), use sscanf("%u%n"), a tad faster and for your code and the same functionality.

I'm not an optimization professional but I think some tips should help.
First of all, I suggest you use filename and num_cols as macros because they tend to be faster as literals when I don't see you changing their values in code.
Seond, using a struct for storing only one member is generally not recommended, but if you want to use it with functions you should only pass pointers. Since I see you're using malloc to store a struct and again for storing the only member then I suppose that is the reason why it is too slow. You're using twice the memory you need. This might not be the case with some compilers, however. Practically, using a struct with only one member is pointless. If you want to ensure that the integer you get (in your case) is specifically a record, you can typedef it.
You should also make end_pointer and fsize const for some optimization.
Now, as for functionality, have a look at memory mapping io.

Related

Optimizing disk IO

I have a piece of code that analyzes streams of data from very large (10-100GB) binary files. It works well, so it's time to start optimizing, and currently disk IO is the biggest bottleneck.
There are two types of files in use. The first type of file consists of a stream of 16-bit integers, which must be scaled after I/O to convert to a floating point value which is physically meaningful. I read the file in chunks, and I read in the chunks of data by reading one 16-bit code at a time, performing the required scaling, and then storing the result in an array. Code is below:
int64_t read_current_chimera(FILE *input, double *current,
int64_t position, int64_t length, chimera *daqsetup)
{
int64_t test;
uint16_t iv;
int64_t i;
int64_t read = 0;
if (fseeko64(input, (off64_t)position * sizeof(uint16_t), SEEK_SET))
{
return 0;
}
for (i = 0; i < length; i++)
{
test = fread(&iv, sizeof(uint16_t), 1, input);
if (test == 1)
{
read++;
current[i] = chimera_gain(iv, daqsetup);
}
else
{
perror("End of file reached");
break;
}
}
return read;
}
The chimera_gain function just takes a 16-bit integer, scales it and returns the double for storage.
The second file type contains 64-bit doubles, but it contains two columns, of which I only need the first. To do this I fread pairs of doubles and discard the second one. The double must also be endian-swapped before use. The code I use to do this is below:
int64_t read_current_double(FILE *input, double *current, int64_t position, int64_t length)
{
int64_t test;
double iv[2];
int64_t i;
int64_t read = 0;
if (fseeko64(input, (off64_t)position * 2 * sizeof(double), SEEK_SET))
{
return 0;
}
for (i = 0; i < length; i++)
{
test = fread(iv, sizeof(double), 2, input);
if (test == 2)
{
read++;
swapByteOrder((int64_t *)&iv[0]);
current[i] = iv[0];
}
else
{
perror("End of file reached: ");
break;
}
}
return read;
}
Can anyone suggest a method of reading these file types that would be significantly faster than what I am currently doing?

First off, it would be useful to use a profiler to identify the hot spots in your program. Based on your description of the problem, you have a lot of overhead going on by the sheer number of freads. As the files are large there will be a big benefit to increasing the amount of data you read per io.
Convince yourself of this by putting together 2 small programs that read the stream.
1) read it as you are in the example above, of 2 doubles.
2) read it the same way, but make it 10,000 doubles.
Time both runs a few times, and odds are you will be observe #2 runs much faster.
Best of luck.

Copy part of array to itself?

Basically I'm building a bit reader, slightly before the buffer has been exhausted, I'd like to copy whatever is left of the array to the beginning of the same array, and then zero everything after the copy, and fill the rest with data from the input file.
I'm only trying to use the standard library for portability reasons.
Also, I was profiling my bit reader earlier and Instruments says it was taking like 28 milliseconds to do all of this, is it supposed to take that long?
Code removed

I recommend using memmove for the copy. It has a signature and functionality identical to memcpy, except that it is safe for copying between overlapping regions (which is what you're describing).
For the zero-fill, memset is usually adequate. On the occasion where null pointers aren't represented using an underlying sequence of zeros, for example, you'll need to roll your own using assignment depending upon the type.
For this reason you might want to hide the memmove and memset operations behind abstraction, for example:
#include <string.h>
void copy_int(int *destination, int *source, size_t size) {
memmove(destination, source, size * sizeof *source);
}
void zero_int(int *seq, size_t size) {
memset(seq, 0, size * sizeof *seq);
}
int main(void) {
int array[] = { 0, 1, 2, 3, 4, 5 };
size_t index = 2
, size = sizeof array / sizeof *array - index;
copy_int(array, array + index, size);
zero_int(array + size, index);
}
Should either memmove or memset become unsuitable for usecases in the future, it'll then be simple to drop in your own copy/zero loops.
As for your strange profiler results, I suppose it might be possible that you're using some archaic (or grossly underclocked) implementation, or trying to copy huge arrays... Otherwise, 28 milliseconds does seem quite absurd. Nonetheless, your profiler would have surely identified that this memmove and memset isn't a significant bottleneck in an a program that performs actual I/O work, right? The I/O must surely be the bottleneck, right?
If the memmove+memset is indeed a bottleneck, you could try implementing a circular array to avoid the copies. For example, the following code attempts to find needle in the figurative haystack that is input_file...
Otherwise, if the I/O is a bottleneck, there are tweaks that can be applied to reduce that. For example, the following code uses setvbuf to suggest that the underlying implementation attempt to use an underlying buffer to read chunks of the file, despite the code using fgetc to read one character at a time.
void find_match(FILE *input_file, char const *needle, size_t needle_size) {
char input_array[needle_size];
size_t sz = fread(input_array, 1, needle_size, input_file);
if (sz != needle_size) {
// No matches possible
return;
}
setvbuf(input_file, NULL, _IOFBF, BUFSIZ);
unsigned long long pos = 0;
for (;;) {
size_t cursor = pos % needle_size;
int tail_compare = memcmp(input_array, needle + needle_size - cursor, cursor),
head_compare = memcmp(input_array + cursor, needle, needle_size - cursor);
if (head_compare == 0 && tail_compare == 0) {
printf("Match found at offset %llu\n", pos);
}
int c = fgetc(input_file);
if (c == EOF) {
break;
}
input_array[cursor] = c;
pos++;
}
}
Notice how there's no memmove (or zeroing, FWIW) necessary here? We simply operate as though the start of the array is at cursor, the end at cursor - 1 and we wrap by modulo needle_size to ensure there's no overflow/underflow. Then, after each insertion we simply increment the cursor...

Understanding And Getting info of Bitmap in C

I am having a hard time understanding and parsing the info data present in a bitmap image. To better understand I read the following tutorial, Raster Data.
Now, The code present there is as follows, (Greyscale 8bit color value)
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
/*-------STRUCTURES---------*/
typedef struct {int rows; int cols; unsigned char* data;} sImage;
/*-------PROTOTYPES---------*/
long getImageInfo(FILE*, long, int);
int main(int argc, char* argv[])
{
FILE *bmpInput, *rasterOutput;
sImage originalImage;
unsigned char someChar;
unsigned char* pChar;
int nColors; /* BMP number of colors */
long fileSize; /* BMP file size */
int vectorSize; /* BMP vector size */
int r, c; /* r = rows, c = cols */
/* initialize pointer */
someChar = '0';
pChar = &someChar;
if(argc < 2)
{
printf("Usage: %s bmpInput.bmp\n", argv[0]);
//end the execution
exit(0);
}
printf("Reading filename %s\n", argv[1]);
/*--------READ INPUT FILE------------*/
bmpInput = fopen(argv[1], "rb");
//fseek(bmpInput, 0L, SEEK_END);
/*--------DECLARE OUTPUT TEXT FILE--------*/
rasterOutput = fopen("data.txt", "w");
/*--------GET BMP DATA---------------*/
originalImage.cols = (int)getImageInfo(bmpInput, 18, 4);
originalImage.rows = (int)getImageInfo(bmpInput, 22, 4);
fileSize = getImageInfo(bmpInput, 2, 4);
nColors = getImageInfo(bmpInput, 46, 4);
vectorSize = fileSize - (14 + 40 + 4*nColors);
/*-------PRINT DATA TO SCREEN-------------*/
printf("Width: %d\n", originalImage.cols);
printf("Height: %d\n", originalImage.rows);
printf("File size: %ld\n", fileSize);
printf("# Colors: %d\n", nColors);
printf("Vector size: %d\n", vectorSize);
/*----START AT BEGINNING OF RASTER DATA-----*/
fseek(bmpInput, (54 + 4*nColors), SEEK_SET);
/*----------READ RASTER DATA----------*/
for(r=0; r<=originalImage.rows - 1; r++)
{
for(c=0; c<=originalImage.cols - 1; c++)
{
/*-----read data and print in (row,column) form----*/
fread(pChar, sizeof(char), 1, bmpInput);
fprintf(rasterOutput, "(%d, %d) = %d\n", r, c, *pChar);
}
}
fclose(bmpInput);
fclose(rasterOutput);
}
/*----------GET IMAGE INFO SUBPROGRAM--------------*/
long getImageInfo(FILE* inputFile, long offset, int numberOfChars)
{
unsigned char *ptrC;
long value = 0L;
unsigned char dummy;
int i;
dummy = '0';
ptrC = &dummy;
fseek(inputFile, offset, SEEK_SET);
for(i=1; i<=numberOfChars; i++)
{
fread(ptrC, sizeof(char), 1, inputFile);
/* calculate value based on adding bytes */
value = (long)(value + (*ptrC)*(pow(256, (i-1))));
}
return(value);
} /* end of getImageInfo */
What I am not understanding:-
I am unable the understand the 'GET IMAGE INTOSUBPROGRAM' part where the code is trying to get the image infos like no of rows,columns, etc. Why are these infos stored over 4 bytes and what is the use of the value = (long)(value + (*ptrC)*(pow(256, (i-1)))); instruction.
Why there unsigned char dummy ='0' is created and then ptrC =&dummy is assigned?
Why can't we just get the no of rows in an image by just reading 1 byte of data like getting the Greyscale value at a particular row and column.
Why are we using unsigned char to store the byte, isn't there some other data type or int or long we can use effectively here?
Please help me understand these doubts(confusions!!?) I am having and forgive me if they sound noobish.
Thank you.

I would say the tutorial is quite bad in some ways and your problems to understand it are not always due to being a beginner.
I am unable the understand the 'GET IMAGE INTOSUBPROGRAM' part where the code is trying to get the image infos like no of rows,columns, etc. Why are these infos stored over 4 bytes and what is the use of the value = (long)(value + (ptrC)(pow(256, (i-1)))); instruction.
The reason to store over 4 bytes is to allow the image to be sized between 0 and 2^32-1 high and wide. If we used just one byte, we could only have images sized 0..255 and with 2 bytes 0..65535.
The strange value = (long)(value + (*ptrC)*(pow(256, (i-1)))); is something I've never seen before. It's used to convert bytes into a long so that it would work with any endianness. The idea is to use powers of 256 to set the *ptrC to the value, i.e. multiplying first byte with 1, next with 256, next with 65536 etc.
A much more readable way would be to use shifts, e.g. value = value + ((long)(*ptrC) << 8*(i-1));. Or even better would be to read bytes from the highest one to lower and use value = value << 8 + *ptrC;. In my eyes a lot better, but when the bytes come in a different order, is not always so simple.
A simple rewrite to be much easier to understand would be
long getImageInfo(FILE* inputFile, long offset, int numberOfChars)
{
unsigned char ptrC;
long value = 0L;
int i;
fseek(inputFile, offset, SEEK_SET);
for(i=0; i<numberOfChars; i++) // Start with zero to make the code simpler
{
fread(&ptrC, 1, 1, inputFile); // sizeof(char) is always 1, no need to use it
value = value + ((long)ptrC << 8*i); // Shifts are a lot simpler to look at and understand what's the meaning
}
return value; // Parentheses would make it look like a function
}
Why there unsigned char dummy ='0' is created and then ptrC =&dummy is assigned?
This is also pointless. They could've just used unsigned char ptrC and then used &ptrC instead of ptrC and ptrC instead of *ptrC. This would've also shown that it is just a normal static variable.
Why can't we just get the no of rows in an image by just reading 1 byte of data like getting the Greyscale value at a particular row and column.
What if the image is 3475 rows high? One byte isn't enough. So it needs more bytes. The way of reading is just a bit complicated.
Why are we using unsigned char to store the byte, isn't there some other data type or int or long we can use effectively here?
Unsigned char is exactly one byte long. Why would we use any other type for storing a byte then?

(4) The data of binary files is made up of bytes, which in C are represented by unsigned char. Because that's a long word to type, it is sometimes typedeffed to byte or uchar. A good standard-compliant way to define bytes is to use uint8_t from <stdint.h>.
(3) I'm not quite sure what you're trying to get at, but the first bytes - usually 54, but there are othzer BMF formats - of a BMP file make up the header, which contains information on colour depth, width and height of an image. The bytes after byte 54 store the raw data. I haven't tested yopur code, but there might be an issue with padding, because the data for each row must be padded to make a raw-data size that is divisible by 4.
(2) There isn't really a point in defining an extra pointer here. You could just as well fread(&dummy, ...) directly.
(1) Ugh. This function reads a multi-byte value from the file at position offset in the file. The file is made up of bytes, but several bytes can form other data types. For example, a 4-byte unsigned word is made up of:
uint8_t raw[4];
uint32_t x;
x = raw[0] + raw[1]*256 + raw[2]*256*256 + raw[3]*256*256*256;
on a PC, which uses Little Endian data.
That example also shows where the pow(256, i) comes in. Using the pow function here is not a good idea, because it is meant to be used with floating-point numbers. Even the multiplication by 256 is not very idiomatic. Usually, we construct values by byte shifting, where a multiplication by 2 is a left-shift by 1 and hence a multiplication by 256 is a left-shift by 8. Similarly, the additions above add non-overlapping ranges and are usually represented as a bitwise OR, |:
x = raw[0] | (raw[1]<<8) | (raw[2]<<16) | (raw[3]<<24);
The function accesses the file by re-positioning the file pointer (and leaving it at the new position). That's not very effective. It would be better to read the header as an 54-byte array and accessing the array directly.
The code is old and clumsy. Seeing something like:
for(r=0; r<=originalImage.rows - 1; r++)
is already enough for me not to trust it. I'm sure you can find a better example of reading greyscale images from BMP. You could even write your own and start with the Wikipedia article on the BMP format.

How do I read and parse a text file with numbers, fast (in C)?

The last time update: my classmate uses fread() to read about one third of the whole file into a string, this can avoid lacking of memory. Then process this string, separate this string into your data structure. Notice, you need to care about one problem: at the end of this string, these last several characters may cannot consist one whole number. Think about one way to detect this situation so you can connect these characters with the first several characters of the next string.
Each number is corresponding to different variable in your data structure. Your data structure should be very simple because each time if you insert your data into one data structure, it is very slow. The most of time is spent on inserting data into data structure. Therefore, the fastest way to process these data is: using fread() to read this file into a string, separate this string into different one-dimensional arrays.
For example(just an example, not come from my project), I have a text file, like:
72 24 20
22 14 30
23 35 40
42 29 50
19 22 60
18 64 70
.
.
.
Each row is one person's information. The first column means the person's age, the second column is his deposit, the second is his wife's age.
Then we use fread() to read this text file into string, then I use stroke() to separate it(you can use faster way to separate it).
Don't use data structure to store the separated data!
I means, don't do like this:
struct person
{
int age;
int deposit;
int wife_age;
};
struct person *my_data_store;
my_data_store=malloc(sizeof(struct person)*length_of_this_array);
//then insert separated data into my_data_store
Don't use data structure to store data!
The fastest way to store your data is like this:
int *age;
int *deposit;
int *wife_age;
age=(int*)malloc(sizeof(int)*age_array_length);
deposit=(int*)malloc(sizeof(int)*deposit_array_length);
wife_age=(int*)malloc(sizeof(int)*wife_array_length);
// the value of age_array_length,deposit_array_length and wife_array_length will be known by using `wc -l`.You can use wc -l to get the value in your C program
// then you can insert separated data into these arrays when you use `stroke()` to separate them.
The second update: The best way is to use freed() to read part of the file into a string, then separate these string into your data structure. By the way, don't use any standard library function which can format string into integer , that's to slow, like fscanf() or atoi(), we should write our own function to transfer a string into n integer. Not only that, we should design a more simpler data structure to store these data. By the way, my classmate can read this 1.7G file within 7 seconds. There is a way can do this. That way is much better than using multithread. I haven't see his code, after I see his code, I will update the third time to tell you how could hi do this. That will be two months later after our course finished.
Update: I use multithread to solve this problem!! It works! Notice: don't use clock() to calculate the time when using multithread, that's why I thought the time of execution increases.
One thing I want to clarify is that, the time of reading the file without storing the value into my structure is about 20 seconds. The time of storing the value into my structure is about 60 seconds. The definition of "time of reading the file" includes the time of read the whole file and store the value into my structure. the time of reading the file = scan the file + store the value into my structure. Therefore, have some suggestions of storing value faster ? (By the way, I don't have control over the inout file, it is generated by our professor. I am trying to use multithread to solve this problem, if it works, I will tell you the result.)
I have a file, its size is 1.7G.
It looks like:
1 1427826
1 1427827
1 1750238
1 2
2 3
2 4
3 5
3 6
10 7
11 794106
.
.
and son on.
It has about ten millions of lines in the file. Now I need to read this file and store these numbers in my data structure within 15 seconds.
I have tried to use freed() to read whole file and then use strtok() to separate each number, but it still need 80 seconds. If I use fscanf(), it will be slower. How do I speed it up? Maybe we cannot make it less than 15 seconds. But 80 seconds to read it is too long. How to read it as fast as we can?
Here is part of my reading code:
int Read_File(FILE *fd,int round)
{
clock_t start_read = clock();
int first,second;
first=0;
second=0;
fseek(fd,0,SEEK_END);
long int fileSize=ftell(fd);
fseek(fd,0,SEEK_SET);
char * buffer=(char *)malloc(sizeof(char)*fileSize);
char *string_first;
long int newFileSize=fread(buffer,1,fileSize,fd);
char *string_second;
while(string_first!=NULL)
{
first=atoi(string_first);
string_second=strtok(NULL," \t\n");
second=atoi(string_second);
string_first=strtok(NULL," \t\n");
max_num= first > max_num ? first : max_num ;
max_num= second > max_num ? second : max_num ;
root_level=first/NUM_OF_EACH_LEVEL;
leaf_addr=first%NUM_OF_EACH_LEVEL;
if(root_addr[root_level][leaf_addr].node_value!=first)
{
root_addr[root_level][leaf_addr].node_value=first;
root_addr[root_level][leaf_addr].head=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].tail=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].g_credit[0]=1;
root_addr[root_level][leaf_addr].head->neighbor_value=second;
root_addr[root_level][leaf_addr].head->next=NULL;
root_addr[root_level][leaf_addr].tail=root_addr[root_level][leaf_addr].head;
root_addr[root_level][leaf_addr].degree=1;
}
else
{
//insert its new neighbor
Neighbor *newNeighbor;
newNeighbor=(Neighbor*)malloc(sizeof(Neighbor));
newNeighbor->neighbor_value=second;
root_addr[root_level][leaf_addr].tail->next=newNeighbor;
root_addr[root_level][leaf_addr].tail=newNeighbor;
root_addr[root_level][leaf_addr].degree++;
}
root_level=second/NUM_OF_EACH_LEVEL;
leaf_addr=second%NUM_OF_EACH_LEVEL;
if(root_addr[root_level][leaf_addr].node_value!=second)
{
root_addr[root_level][leaf_addr].node_value=second;
root_addr[root_level][leaf_addr].head=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].tail=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].head->neighbor_value=first;
root_addr[root_level][leaf_addr].head->next=NULL;
root_addr[root_level][leaf_addr].tail=root_addr[root_level][leaf_addr].head;
root_addr[root_level][leaf_addr].degree=1;
root_addr[root_level][leaf_addr].g_credit[0]=1;
}
else
{
//insert its new neighbor
Neighbor *newNeighbor;
newNeighbor=(Neighbor*)malloc(sizeof(Neighbor));
newNeighbor->neighbor_value=first;
root_addr[root_level][leaf_addr].tail->next=newNeighbor;
root_addr[root_level][leaf_addr].tail=newNeighbor;
root_addr[root_level][leaf_addr].degree++;
}
}

Some suggestions:
a) Consider converting (or pre-processing) the file into a binary format; with the aim to minimise the file size and also drastically reduce the cost of parsing. I don't know the ranges for your values, but various techniques (e.g. using one bit to tell if the number is small or large and storing the number as either a 7-bit integer or a 31-bit integer) could halve the file IO (and double the speed of reading the file from disk) and slash parsing costs down to almost nothing. Note: For maximum effect you'd modify whatever software created the file in the first place.
b) Reading the entire file into memory before you parse it is a mistake. It doubles the amount of RAM required (and the cost of allocating/freeing) and has disadvantages for CPU caches. Instead read a small amount of the file (e.g. 16 KiB) and process it, then read the next piece and process it, and so on; so that you're constantly reusing the same small buffer memory.
c) Use parallelism for file IO. It shouldn't be hard to read the next piece of the file while you're processing the previous piece of the file (either by using 2 threads or by using asynchronous IO).
d) Pre-allocate memory for the "neighbour" structures and remove most/all malloc() calls from your loop. The best possible case is to use a statically allocated array as a pool - e.g. Neighbor myPool[MAX_NEIGHBORS]; where malloc() can be replaced with &myPool[nextEntry++];. This reduces/removes the overhead of malloc() while also improving cache locality for the data itself.
e) Use parallelism for storing values. For example, you could have multiple threads where the first thread handles all the cases where root_level % NUM_THREADS == 0, the second thread handles all cases where root_level % NUM_THREADS == 1, etc.
With all of the above (assuming a modern 4-core CPU), I think you can get the total time (for reading and storing) down to less than 15 seconds.

My suggestion would be to form a processing pipeline and thread it. Reading the file is an I/O bound task and parsing it is CPU bound. They can be done at the same time in parallel.

There are several possibilities. You'll have to experiment.
Exploit what your OS gives you. If Windows, check out overlapped io. This lets your computation proceed with parsing one buffer full of data while the Windows kernel fills another. Then switch buffers and continue. This is related to what #Neal suggested, but has less overhead for buffering. Windows is depositing data directly in your buffer through the DMA channel. No copying. If Linux, check out memory mapped files. Here the OS is using the virtual memory hardware to do more-or-less what Windows does with overlapping.
Code your own integer conversion. This is likely to be a bit faster than making a clib call per integer.
Here's example code. You want to absolutely limit the number of comparisons.
// Process one input buffer.
*end_buf = ' '; // add a sentinel at the end of the buffer
for (char *p = buf; p < end_buf; p++) {
// somewhat unsafe (but fast) reliance on unsigned wrapping
unsigned val = *p - '0';
if (val <= 9) {
// Found start of integer.
for (;;) {
unsigned digit_val = *p - '0';
if (digit_val > 9) break;
val = 10 * val + digit_val;
p++;
}
... do something with val
}
}
Don't call malloc once per record. You should allocate blocks of many structs at a time.
Experiment with buffer sizes.
Crank up compiler optimizations. This is the kind of code that benefits greatly from excellent code generation.

Yes, standard library conversion functions are surprisingly slow.
If portability is not a problem, I'd memory-map the file. Then, something like the following C99 code (untested) could be used to parse the entire memory map:
#include <stdlib.h>
#include <errno.h>
struct pair {
unsigned long key;
unsigned long value;
};
typedef struct {
size_t size; /* Maximum number of items */
size_t used; /* Number of items used */
struct pair item[];
} items;
/* Initial number of items to allocate for */
#ifndef ITEM_ALLOC_SIZE
#define ITEM_ALLOC_SIZE 8388608
#endif
/* Adjustment to new size (parameter is old number of items) */
#ifndef ITEM_REALLOC_SIZE
#define ITEM_REALLOC_SIZE(from) (((from) | 1048575) + 1048577)
#endif
items *parse_items(const void *const data, const size_t length)
{
const unsigned char *ptr = (const unsigned char *)data;
const unsigned char *const end = (const unsigned char *)data + length;
items *result;
size_t size = ITEMS_ALLOC_SIZE;
size_t used = 0;
unsigned long val1, val2;
result = malloc(sizeof (items) + size * sizeof (struct pair));
if (!result) {
errno = ENOMEM;
return NULL;
}
while (ptr < end) {
/* Skip newlines and whitespace. */
while (ptr < end && (*ptr == '\0' || *ptr == '\t' ||
*ptr == '\n' || *ptr == '\v' ||
*ptr == '\f' || *ptr == '\r' ||
*ptr == ' '))
ptr++;
/* End of data? */
if (ptr >= end)
break;
/* Parse first number. */
if (*ptr >= '0' && *ptr <= '9')
val1 = *(ptr++) - '0';
else {
free(result);
errno = ECOMM; /* Bad data! */
return NULL;
}
while (ptr < end && *ptr >= '0' && *ptr <= '9') {
const unsigned long old = val1;
val1 = 10UL * val1 + (*(ptr++) - '0');
if (val1 < old) {
free(result);
errno = EDOM; /* Overflow! */
return NULL;
}
}
/* Skip whitespace. */
while (ptr < end && (*ptr == '\t' || *ptr == '\v'
*ptr == '\f' || *ptr == ' '))
ptr++;
if (ptr >= end) {
free(result);
errno = ECOMM; /* Bad data! */
return NULL;
}
/* Parse second number. */
if (*ptr >= '0' && *ptr <= '9')
val2 = *(ptr++) - '0';
else {
free(result);
errno = ECOMM; /* Bad data! */
return NULL;
}
while (ptr < end && *ptr >= '0' && *ptr <= '9') {
const unsigned long old = val2;
val1 = 10UL * val2 + (*(ptr++) - '0');
if (val2 < old) {
free(result);
errno = EDOM; /* Overflow! */
return NULL;
}
}
if (ptr < end) {
/* Error unless whitespace or newline. */
if (*ptr != '\0' && *ptr != '\t' && *ptr != '\n' &&
*ptr != '\v' && *ptr != '\f' && *ptr != '\r' &&
*ptr != ' ') {
free(result);
errno = ECOMM; /* Bad data! */
return NULL;
}
/* Skip the rest of this line. */
while (ptr < end && *ptr != '\n' && *ptr != '\r')
ptr++;
}
/* Need to grow result? */
if (used >= size) {
items *const old = result;
size = ITEMS_REALLOC_SIZE(used);
result = realloc(result, sizeof (items) + size * sizeof (struct pair));
if (!result) {
free(old);
errno = ENOMEM;
return NULL;
}
}
result->items[used].key = val1;
result->items[used].value = val2;
used++;
}
/* Note: we could reallocate result here,
* if memory use is an issue.
*/
result->size = size;
result->used = used;
errno = 0;
return result;
}
I've used a similar approach to load molecular data for visualization. Such data contains floating-point values, but precision is typically only about seven significant digits, no multiprecision math needed. A custom routine to parse such data beats the standard functions by at least an order of magnitude in speed.
At least the Linux kernel is pretty good at observing memory/file access patterns; using madvise() also helps.
If you cannot use a memory map, then the parsing function would be a bit different: it would append to an existing result, and if the final line in the buffer is partial, it would indicate so (and the number of chars not parsed), so that the caller can memmove() the buffer, read more data, and continue parsing. (Use 16-byte aligned addresses for reading new data, to maximize copy speeds. You don't necessarily need to move the unread data to the exact beginning of the buffer, you see; just keep the current position in the buffered data.)
Questions?

First, what's your disk hardware? A single SATA drive is likely to be topped out at 100 MB/sec. And probably more like 50-70 MB/sec. If you're already moving data off the drive(s) as fast as you can, all the software tuning you do is going to be wasted.
If your hardware CAN support reading faster? First, your read pattern - read the whole file into memory once - is the perfect use-case for direct IO. Open your file using open( "/file/name", O_RDONLY | O_DIRECT );. Read to page-aligned buffers (see man page for valloc()) in page-sized chunks. Using direct IO will cause your data to bypass double buffering in the kernel page cache, which is useless when you're reading that much data that fast and not re-reading the same data pages over and over.
If you're running on a true high-performance file system, you can read asynchronously and likely faster with lio_listio() or aio_read(). Or you can just use multiple threads to read - and use pread() so you don't have waste time seeking - and because when you read using multiple threads a seek on an open file affects all threads trying to read from the file.
And do not try to read fast into a newly-malloc'd chunk of memory - memset() it first. Because truly fast disk systems can pump data into the CPU faster than the virtual memory manager can create virtual pages for a process.

Sequential, subsequent loading of files gets much slower over time

I've got the following code to read and process multiple very big files one after another.
for(j = 0; j < CORES; ++j) {
double time = omp_get_wtime();
printf("File: %d, time: %f\n", j, time);
char in[256];
sprintf(in, "%s.%d", FIN, j);
FILE* f = fopen(in, "r");
if (f == NULL)
fprintf(stderr, "open failed: %s\n", FIN);
int i;
char buffer[1024];
char* tweet;
int takeTime = 1;
for (i = 0, tweet = TWEETS + (size_t)j*(size_t)TNUM*(size_t)TSIZE; i < TNUM; i++, tweet += TSIZE) {
double start;
double end;
if(takeTime) {
start = omp_get_wtime();
takeTime = 0;
}
char* line = fgets(buffer, 1024, f);
if (line == NULL) {
fprintf(stderr, "error reading line %d\n", i);
exit(2);
}
int fn = readNumber(&line);
int ln = readNumber(&line);
int month = readMonth(&line);
int day = readNumber(&line);
int hits = countHits(line, key);
writeTweet(tweet, fn, ln, hits, month, day, line);
if(i%1000000 == 0) {
end = omp_get_wtime();
printf("Line: %d, Time: %f\n", i, end-start);
takeTime = 1;
}
}
fclose(f);
}
Every file contains 24000000 tweets and I read 8 files in total, one after another.
Each line (1 tweet) gets processed and writeTweet() copies a modified line in one really big char array.
As you can see, I measure the times to see how long it takes to read and process 1 million tweets. For the first file, its about 0.5 seconds per 1 million, which is fast enough. But after every additional file, it takes longer and longer. File 2 takes about 1 second per 1 million lines (but not everytime, just some of the iterations), up to 8 seconds on file number 8. Is this to be expected? Can I speed things up? All files are more or less completely the same, always with 24 million lines.
Edit:
Additional information: Every file needs, in processed form, about 730MB RAM. That means, using 8 files we end up with memory need of about 6GB.
As wished, the content of writeTweet()
void writeTweet(char* tweet, const int fn, const int ln, const int hits, const int month, const int day, char* line) {
short* ptr1 = (short*) tweet;
*ptr1 = (short) fn;
int* ptr2 = (int*) (tweet + 2);
*ptr2 = ln;
*(tweet + 6) = (char) hits;
*(tweet + 7) = (char) month;
*(tweet + 8) = (char) day;
int i;
int n = TSIZE - 9;
for (i = strlen(line); i < n; i++)
line[i] = ' '; // padding
memcpy(tweet + 9, line, n);
}

Probably, writeTweet() is a bottleneck. If you copy all processed tweets in memory, the huge data array with which the operating system has to do something is formed over time. If you have not enough memory, or other processes in system actively use it, OS will dump (in most cases) part of data on a disk. It increases time of access to the array. There is a more hidden from user eyes mechanisms in OS which can affect performance.
You shouldn't store all processed lines in memory. The simplest way: to dump the processed tweets on a disk (write a file). However solution depends on how you use further the processed tweets. If you not sequentially use data from array, it is worth thinking of special data structure to storage (B-trees?) . Already there is a many libraries for this purpose -- better to look for them.
UPD:
Modern OSs (Linux including) use virtual memory model. For maintenance of this model in a kernel there is a special memory manager who creates special structures of references to real pages in memory. Usually it's maps, for large memory volumes they referenced to sub-maps -- it is rather big branched structure.
During work with a big piece of memory it is necessary to address to any pages of memory often randomly. For address acceleration OS uses a special cache. I don't know all subtleties of this process, but I think that in this case cache should be often invalidate because there is no memory for storage all references at the same time. It is expensive operation brings to performance reduction. It will be that more, than memory is more much used.
If you need to sort large tweets array, it isn't obligatory for you to store everything in memory. There are ways to sorting data on a disk. If you want to sort data in memory, it isn't necessary to do the real swap operations on array elements. It's better to use intermediate structure with references to elements in tweets array, and to sort references instead of data.