Sequential, subsequent loading of files gets much slower over time

Sequential, subsequent loading of files gets much slower over time - c

I've got the following code to read and process multiple very big files one after another.
for(j = 0; j < CORES; ++j) {
double time = omp_get_wtime();
printf("File: %d, time: %f\n", j, time);
char in[256];
sprintf(in, "%s.%d", FIN, j);
FILE* f = fopen(in, "r");
if (f == NULL)
fprintf(stderr, "open failed: %s\n", FIN);
int i;
char buffer[1024];
char* tweet;
int takeTime = 1;
for (i = 0, tweet = TWEETS + (size_t)j*(size_t)TNUM*(size_t)TSIZE; i < TNUM; i++, tweet += TSIZE) {
double start;
double end;
if(takeTime) {
start = omp_get_wtime();
takeTime = 0;
}
char* line = fgets(buffer, 1024, f);
if (line == NULL) {
fprintf(stderr, "error reading line %d\n", i);
exit(2);
}
int fn = readNumber(&line);
int ln = readNumber(&line);
int month = readMonth(&line);
int day = readNumber(&line);
int hits = countHits(line, key);
writeTweet(tweet, fn, ln, hits, month, day, line);
if(i%1000000 == 0) {
end = omp_get_wtime();
printf("Line: %d, Time: %f\n", i, end-start);
takeTime = 1;
}
}
fclose(f);
}
Every file contains 24000000 tweets and I read 8 files in total, one after another.
Each line (1 tweet) gets processed and writeTweet() copies a modified line in one really big char array.
As you can see, I measure the times to see how long it takes to read and process 1 million tweets. For the first file, its about 0.5 seconds per 1 million, which is fast enough. But after every additional file, it takes longer and longer. File 2 takes about 1 second per 1 million lines (but not everytime, just some of the iterations), up to 8 seconds on file number 8. Is this to be expected? Can I speed things up? All files are more or less completely the same, always with 24 million lines.
Edit:
Additional information: Every file needs, in processed form, about 730MB RAM. That means, using 8 files we end up with memory need of about 6GB.
As wished, the content of writeTweet()
void writeTweet(char* tweet, const int fn, const int ln, const int hits, const int month, const int day, char* line) {
short* ptr1 = (short*) tweet;
*ptr1 = (short) fn;
int* ptr2 = (int*) (tweet + 2);
*ptr2 = ln;
*(tweet + 6) = (char) hits;
*(tweet + 7) = (char) month;
*(tweet + 8) = (char) day;
int i;
int n = TSIZE - 9;
for (i = strlen(line); i < n; i++)
line[i] = ' '; // padding
memcpy(tweet + 9, line, n);
}

Probably, writeTweet() is a bottleneck. If you copy all processed tweets in memory, the huge data array with which the operating system has to do something is formed over time. If you have not enough memory, or other processes in system actively use it, OS will dump (in most cases) part of data on a disk. It increases time of access to the array. There is a more hidden from user eyes mechanisms in OS which can affect performance.
You shouldn't store all processed lines in memory. The simplest way: to dump the processed tweets on a disk (write a file). However solution depends on how you use further the processed tweets. If you not sequentially use data from array, it is worth thinking of special data structure to storage (B-trees?) . Already there is a many libraries for this purpose -- better to look for them.
UPD:
Modern OSs (Linux including) use virtual memory model. For maintenance of this model in a kernel there is a special memory manager who creates special structures of references to real pages in memory. Usually it's maps, for large memory volumes they referenced to sub-maps -- it is rather big branched structure.
During work with a big piece of memory it is necessary to address to any pages of memory often randomly. For address acceleration OS uses a special cache. I don't know all subtleties of this process, but I think that in this case cache should be often invalidate because there is no memory for storage all references at the same time. It is expensive operation brings to performance reduction. It will be that more, than memory is more much used.
If you need to sort large tweets array, it isn't obligatory for you to store everything in memory. There are ways to sorting data on a disk. If you want to sort data in memory, it isn't necessary to do the real swap operations on array elements. It's better to use intermediate structure with references to elements in tweets array, and to sort references instead of data.

Related

In this case, how to save data more efficiently and conveniently?

I am measuring the latency of some operations.
There are many scenarios here.
The delay of each scene is roughly distributed in a small interval. For each scenario, I need to measure 500,000 times. Finally I want to output the delay value and its corresponding number of times.
My initial implementation was:
#define range 1000
int rec_array[range];
for (int i = 0; i < 500000; i++) {
int latency = measure_latency();
rec_array[latency]++;
}
for (int i = 0; i < range; i++) {
printf("%d %d\n", i, rec_array[i]);
}
But this approach was fine at first, but as the number of scenes grew, it became problematic.
The delay measured in each scene is concentrated in a small interval. So for most of the data in the rec_array array is 0.
Since each scene is different, the delay value is also different. Some delays are concentrated around 500, and I need to create an array with a length greater than 500. But some are concentrated around 5000, and I need to create an array with a length greater than 5000.
Due to the large number of scenes, I created too many arrays. For example I have ten scenes and I need to create ten rec_arrays. And I also set them to be different lengths.
Is there any efficient and convenient strategy? Since I am using C language, templates like vector cannot be used.
I considered linked lists. However, considering that the interval of the delay value distribution is uncertain, and how many certain delay values are uncertain, and when the same delay occurs, the timing value needs to be increased. It also doesn't seem very convenient.
I'm sorry, I just went out. Thank you for your help. I read the comments carefully. Here are some of my answers.
These data are mainly used to draw pictures,For example, this one below.
The comment area says that data seems small. The main reason why I thought about this problem is that according to the picture, only a few arrays are used each time, and the vast majority are 0. And there are many scenarios where I need to generate an array for each. I have referred to an open source implementation.
According to the comments, it seems that using arrays directly is a good solution, considering fast access. Thanks veru much!

A linked list is probably (and almost always) the least efficient way to store things – both slow as hell, and memory inefficient, since your values use less storage than your pointers. Linked lists are very rarely a good solution for anything that actually stores significant data. The only reason they're so prevalent is that C still has no proper containers, and they're easy wheels to
reinvent for every single C program you write.
#define range 1000
int rec_array[range];
So you're (probably! This depends on your compiler and where you write int rec_array[range];) storing rec_array on the stack, and it's large. (Actually, 4000 Bytes is not "large" by any modern computer's means, but still.) You should not be doing that; instead, this should be heap allocated, once, at initialization.
The solution is to allocate it:
/* SPDX-License-Identifier: LGPL-2.1+ */
/* Copyright Marcus Müller and others */
#include <stdlib.h>
#define N_RUNS 500000
/*
* Call as
* program maximum_latency
*/
unsigned int *run_benchmark(struct task_t task, unsigned int *latencies,
unsigned int *max_latency) {
for (unsigned int run = 0; run < N_RUNS; ++run) {
unsigned int latency = measure_latency();
if (latency >= *max_latency) {
latency = *max_latency - 1;
/*
* alternatively: use realloc to increase the size of the `latencies`,
* and update max_latency as well; that's basically what C++ std::vector
* does
*/
(latencies[latency])++;
}
}
return latencies;
}
int main(int argc, char **argv) {
// check argument
if (argc != 2) {
exit(127);
}
int maximum_latency_raw = atoi(argv[1]);
if (maximum_latency_raw <= 0) {
exit(126);
}
unsigned int maximum_latency = maximum_latency_raw;
/*
* note that the length does no longer have to be a constant
* if you're using calloc/malloc.
*/
unsigned int *latency_counters =
(unsigned int *)calloc(maximum_latency, sizeof(unsigned int));
for (; /* benchmark task in benchmark_tasks */;) {
run_benchmark(task, latency_counters, &maximum_latency);
print_benchmark_result(latency_counters, maximum_latency);
// clear our counters after each run!
memset(latency_counters, 0, maximum_latency * sizeof(unsigned int));
}
}
void print_benchmark_result(unsigned int *array, unsigned int length) {
for (unsigned int index = 0; index < length; ++index) {
printf("%d %d\n", i, rec_array[i]);
}
puts("============================\n");
}
Note especially the "alternatively: realloc" comment in the middle: realloc allows you to increase the size of your array:
unsigned int *run_benchmark(struct task_t task, unsigned int *latencies,
unsigned int *max_latency) {
for (unsigned int run = 0; run < N_RUNS; ++run) {
unsigned int latency = measure_latency();
if (latency >= *max_latency) {
// double the size!
latencies = (unsigned int *)realloc(latencies, (*max_latency) * 2 *
sizeof(unsigned int));
// realloc doesn't zero out the extension, so we need to do that
// ourselves.
memset(latencies + (*max_latency), 0, (*max_latency)*sizeof(unsigned int);
(*max_latency) *= 2;
(latencies[latency])++;
}
}
return latencies;
}
This way, your array grows when you need it to!

how about using a Hash table so we would only save the latency used and maybe the keys in the Hash table can be ranges while the values of said keys be the actual latency?

Just sacrifice some precision in your latencies like 0-15, 16-31, 32-47 ... etc. Now your array will be 16x smaller.
Allocate all latency counter arrays for all scenes in one go
unsigned int *latency_div16_counter = (unsigned int *)calloc((MAX_LATENCY >> 4) * NUM_OF_SCENES, sizeof(unsigned int));
Clamp the values to the max latency, div 16 and store
for (int scene = 0; scene < NUM_OF_SCENES; scene++) {
for (int i = 0; i < 500000; i++) {
int latency = measure_latency();
if(latency >= MAX_LATENCY) latency = MAX_LATENCY - 1;
latency = latency >> 4; // int div 16
latency_div16_counter[(scene * MAX_LATENCY) + latency]++;
}
}
Adjust the data (mul 16) before displaying it
for (int scene = 0; scene < NUM_OF_SCENES; scene++) {
for (int i = 0; i < (MAX_LATENCY >> 4); i++) {
printf("Scene %d Latency %d Total %d\n", scene, i * 16, latency_div16_counter[i]);
}
}

Optimizing disk IO

I have a piece of code that analyzes streams of data from very large (10-100GB) binary files. It works well, so it's time to start optimizing, and currently disk IO is the biggest bottleneck.
There are two types of files in use. The first type of file consists of a stream of 16-bit integers, which must be scaled after I/O to convert to a floating point value which is physically meaningful. I read the file in chunks, and I read in the chunks of data by reading one 16-bit code at a time, performing the required scaling, and then storing the result in an array. Code is below:
int64_t read_current_chimera(FILE *input, double *current,
int64_t position, int64_t length, chimera *daqsetup)
{
int64_t test;
uint16_t iv;
int64_t i;
int64_t read = 0;
if (fseeko64(input, (off64_t)position * sizeof(uint16_t), SEEK_SET))
{
return 0;
}
for (i = 0; i < length; i++)
{
test = fread(&iv, sizeof(uint16_t), 1, input);
if (test == 1)
{
read++;
current[i] = chimera_gain(iv, daqsetup);
}
else
{
perror("End of file reached");
break;
}
}
return read;
}
The chimera_gain function just takes a 16-bit integer, scales it and returns the double for storage.
The second file type contains 64-bit doubles, but it contains two columns, of which I only need the first. To do this I fread pairs of doubles and discard the second one. The double must also be endian-swapped before use. The code I use to do this is below:
int64_t read_current_double(FILE *input, double *current, int64_t position, int64_t length)
{
int64_t test;
double iv[2];
int64_t i;
int64_t read = 0;
if (fseeko64(input, (off64_t)position * 2 * sizeof(double), SEEK_SET))
{
return 0;
}
for (i = 0; i < length; i++)
{
test = fread(iv, sizeof(double), 2, input);
if (test == 2)
{
read++;
swapByteOrder((int64_t *)&iv[0]);
current[i] = iv[0];
}
else
{
perror("End of file reached: ");
break;
}
}
return read;
}
Can anyone suggest a method of reading these file types that would be significantly faster than what I am currently doing?

First off, it would be useful to use a profiler to identify the hot spots in your program. Based on your description of the problem, you have a lot of overhead going on by the sheer number of freads. As the files are large there will be a big benefit to increasing the amount of data you read per io.
Convince yourself of this by putting together 2 small programs that read the stream.
1) read it as you are in the example above, of 2 doubles.
2) read it the same way, but make it 10,000 doubles.
Time both runs a few times, and odds are you will be observe #2 runs much faster.
Best of luck.

Effect of cache size on code

I want to study the effect of the cache size on code. For programs operating on large arrays, there can be a significant speed-up if the array fits in the cache.
How can I meassure this?
I tried to run this c program:
#define L1_CACHE_SIZE 32 // Kbytes 8192 integers
#define L2_CACHE_SIZE 256 // Kbytes 65536 integers
#define L3_CACHE_SIZE 4096 // Kbytes
#define ARRAYSIZE 32000
#define ITERATIONS 250
int arr[ARRAYSIZE];
/*************** TIME MEASSUREMENTS ***************/
double microsecs() {
struct timeval t;
if (gettimeofday(&t, NULL) < 0 )
return 0.0;
return (t.tv_usec + t.tv_sec * 1000000.0);
}
void init_array() {
int i;
for (i = 0; i < ARRAYSIZE; i++) {
arr[i] = (rand() % 100);
}
}
int operation() {
int i, j;
int sum = 0;
for (j = 0; j < ITERATIONS; j++) {
for (i = 0; i < ARRAYSIZE; i++) {
sum =+ arr[i];
}
}
return sum;
}
void main() {
init_array();
double t1 = microsecs();
int result = operation();
double t2 = microsecs();
double t = t2 - t1;
printf("CPU time %f milliseconds\n", t/1000);
printf("Result: %d\n", result);
}
taking values of ARRAYSIZE and ITERATIONS (keeping the product, and hence the number of instructions, constant) in order to check if the program run faster if the array fits in the cache, but I always get the same CPU time.
Can anyone say what I am doing wrong?

What you really want to do is build a "memory mountain." A memory mountain helps you visualize how memory accesses affect program performance. Specifically, it measures read throughput vs spatial locality and temporal locality. Good spatial locality means that consecutive memory accesses are near each other and good temporal locality means that a certain memory location is accessed multiple times in a short amount of program time. Here is a link that briefly mentions cache performance and memory mountains. The 3rd edition of the textbook mentioned in that link is a very good reference, specifically chapter 6, for learning about memory and cache performance. (In fact, I'm currently using that section as a reference as I answer this question.)
Another link shows a test function that you could use to measure cache performance, which I have copied here:
void test(int elems, int stride)
{
int i, result = 0;
volatile int sink;
for (i = 0; i < elems; i+=stride)
result += data[i];
sink = result;
}
Stride is the temporal locality - how far apart the memory accesses are.
The idea is that this function would estimate the number of cycles that it took to run. To get throughput, you'll want to take (size / stride) / (cycles / MHz), where size is the size of the array in bytes, cycles is the result of this function, and MHz is the clock speed of your processor. You'd want to call this once before you take any measurements to "warm up" your cache. Then, run the loop and take measurements.
I found a GitHub repository that you could use to build a 3D memory mountain on your own machine. I encourage you to try it on multiple machines with different processors and compare differences.

There's a typo in your code. =+ instead of +=.

The arr array is linked into the BSS [uninitialized] section. The default value for the variables in this section is zero. All pages in this section are initially mapped R/O to a single zero page. This is linux/Unix centric, but, probably applies to most modern OSes
So, regardless of the array size, you're only fetching from a single page, which will get cached, so that's why you get the same results.
You'll need to break the "zero page mapping" by writing something to all of arr before doing your tests. That is, do something like memset first. This will cause the OS to create a linear page mapping for arr using its COW (copy-on-write) mechanism.

How do I read and parse a text file with numbers, fast (in C)?

The last time update: my classmate uses fread() to read about one third of the whole file into a string, this can avoid lacking of memory. Then process this string, separate this string into your data structure. Notice, you need to care about one problem: at the end of this string, these last several characters may cannot consist one whole number. Think about one way to detect this situation so you can connect these characters with the first several characters of the next string.
Each number is corresponding to different variable in your data structure. Your data structure should be very simple because each time if you insert your data into one data structure, it is very slow. The most of time is spent on inserting data into data structure. Therefore, the fastest way to process these data is: using fread() to read this file into a string, separate this string into different one-dimensional arrays.
For example(just an example, not come from my project), I have a text file, like:
72 24 20
22 14 30
23 35 40
42 29 50
19 22 60
18 64 70
.
.
.
Each row is one person's information. The first column means the person's age, the second column is his deposit, the second is his wife's age.
Then we use fread() to read this text file into string, then I use stroke() to separate it(you can use faster way to separate it).
Don't use data structure to store the separated data!
I means, don't do like this:
struct person
{
int age;
int deposit;
int wife_age;
};
struct person *my_data_store;
my_data_store=malloc(sizeof(struct person)*length_of_this_array);
//then insert separated data into my_data_store
Don't use data structure to store data!
The fastest way to store your data is like this:
int *age;
int *deposit;
int *wife_age;
age=(int*)malloc(sizeof(int)*age_array_length);
deposit=(int*)malloc(sizeof(int)*deposit_array_length);
wife_age=(int*)malloc(sizeof(int)*wife_array_length);
// the value of age_array_length,deposit_array_length and wife_array_length will be known by using `wc -l`.You can use wc -l to get the value in your C program
// then you can insert separated data into these arrays when you use `stroke()` to separate them.
The second update: The best way is to use freed() to read part of the file into a string, then separate these string into your data structure. By the way, don't use any standard library function which can format string into integer , that's to slow, like fscanf() or atoi(), we should write our own function to transfer a string into n integer. Not only that, we should design a more simpler data structure to store these data. By the way, my classmate can read this 1.7G file within 7 seconds. There is a way can do this. That way is much better than using multithread. I haven't see his code, after I see his code, I will update the third time to tell you how could hi do this. That will be two months later after our course finished.
Update: I use multithread to solve this problem!! It works! Notice: don't use clock() to calculate the time when using multithread, that's why I thought the time of execution increases.
One thing I want to clarify is that, the time of reading the file without storing the value into my structure is about 20 seconds. The time of storing the value into my structure is about 60 seconds. The definition of "time of reading the file" includes the time of read the whole file and store the value into my structure. the time of reading the file = scan the file + store the value into my structure. Therefore, have some suggestions of storing value faster ? (By the way, I don't have control over the inout file, it is generated by our professor. I am trying to use multithread to solve this problem, if it works, I will tell you the result.)
I have a file, its size is 1.7G.
It looks like:
1 1427826
1 1427827
1 1750238
1 2
2 3
2 4
3 5
3 6
10 7
11 794106
.
.
and son on.
It has about ten millions of lines in the file. Now I need to read this file and store these numbers in my data structure within 15 seconds.
I have tried to use freed() to read whole file and then use strtok() to separate each number, but it still need 80 seconds. If I use fscanf(), it will be slower. How do I speed it up? Maybe we cannot make it less than 15 seconds. But 80 seconds to read it is too long. How to read it as fast as we can?
Here is part of my reading code:
int Read_File(FILE *fd,int round)
{
clock_t start_read = clock();
int first,second;
first=0;
second=0;
fseek(fd,0,SEEK_END);
long int fileSize=ftell(fd);
fseek(fd,0,SEEK_SET);
char * buffer=(char *)malloc(sizeof(char)*fileSize);
char *string_first;
long int newFileSize=fread(buffer,1,fileSize,fd);
char *string_second;
while(string_first!=NULL)
{
first=atoi(string_first);
string_second=strtok(NULL," \t\n");
second=atoi(string_second);
string_first=strtok(NULL," \t\n");
max_num= first > max_num ? first : max_num ;
max_num= second > max_num ? second : max_num ;
root_level=first/NUM_OF_EACH_LEVEL;
leaf_addr=first%NUM_OF_EACH_LEVEL;
if(root_addr[root_level][leaf_addr].node_value!=first)
{
root_addr[root_level][leaf_addr].node_value=first;
root_addr[root_level][leaf_addr].head=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].tail=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].g_credit[0]=1;
root_addr[root_level][leaf_addr].head->neighbor_value=second;
root_addr[root_level][leaf_addr].head->next=NULL;
root_addr[root_level][leaf_addr].tail=root_addr[root_level][leaf_addr].head;
root_addr[root_level][leaf_addr].degree=1;
}
else
{
//insert its new neighbor
Neighbor *newNeighbor;
newNeighbor=(Neighbor*)malloc(sizeof(Neighbor));
newNeighbor->neighbor_value=second;
root_addr[root_level][leaf_addr].tail->next=newNeighbor;
root_addr[root_level][leaf_addr].tail=newNeighbor;
root_addr[root_level][leaf_addr].degree++;
}
root_level=second/NUM_OF_EACH_LEVEL;
leaf_addr=second%NUM_OF_EACH_LEVEL;
if(root_addr[root_level][leaf_addr].node_value!=second)
{
root_addr[root_level][leaf_addr].node_value=second;
root_addr[root_level][leaf_addr].head=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].tail=(Neighbor *)malloc(sizeof(Neighbor));
root_addr[root_level][leaf_addr].head->neighbor_value=first;
root_addr[root_level][leaf_addr].head->next=NULL;
root_addr[root_level][leaf_addr].tail=root_addr[root_level][leaf_addr].head;
root_addr[root_level][leaf_addr].degree=1;
root_addr[root_level][leaf_addr].g_credit[0]=1;
}
else
{
//insert its new neighbor
Neighbor *newNeighbor;
newNeighbor=(Neighbor*)malloc(sizeof(Neighbor));
newNeighbor->neighbor_value=first;
root_addr[root_level][leaf_addr].tail->next=newNeighbor;
root_addr[root_level][leaf_addr].tail=newNeighbor;
root_addr[root_level][leaf_addr].degree++;
}
}

Some suggestions:
a) Consider converting (or pre-processing) the file into a binary format; with the aim to minimise the file size and also drastically reduce the cost of parsing. I don't know the ranges for your values, but various techniques (e.g. using one bit to tell if the number is small or large and storing the number as either a 7-bit integer or a 31-bit integer) could halve the file IO (and double the speed of reading the file from disk) and slash parsing costs down to almost nothing. Note: For maximum effect you'd modify whatever software created the file in the first place.
b) Reading the entire file into memory before you parse it is a mistake. It doubles the amount of RAM required (and the cost of allocating/freeing) and has disadvantages for CPU caches. Instead read a small amount of the file (e.g. 16 KiB) and process it, then read the next piece and process it, and so on; so that you're constantly reusing the same small buffer memory.
c) Use parallelism for file IO. It shouldn't be hard to read the next piece of the file while you're processing the previous piece of the file (either by using 2 threads or by using asynchronous IO).
d) Pre-allocate memory for the "neighbour" structures and remove most/all malloc() calls from your loop. The best possible case is to use a statically allocated array as a pool - e.g. Neighbor myPool[MAX_NEIGHBORS]; where malloc() can be replaced with &myPool[nextEntry++];. This reduces/removes the overhead of malloc() while also improving cache locality for the data itself.
e) Use parallelism for storing values. For example, you could have multiple threads where the first thread handles all the cases where root_level % NUM_THREADS == 0, the second thread handles all cases where root_level % NUM_THREADS == 1, etc.
With all of the above (assuming a modern 4-core CPU), I think you can get the total time (for reading and storing) down to less than 15 seconds.

My suggestion would be to form a processing pipeline and thread it. Reading the file is an I/O bound task and parsing it is CPU bound. They can be done at the same time in parallel.

There are several possibilities. You'll have to experiment.
Exploit what your OS gives you. If Windows, check out overlapped io. This lets your computation proceed with parsing one buffer full of data while the Windows kernel fills another. Then switch buffers and continue. This is related to what #Neal suggested, but has less overhead for buffering. Windows is depositing data directly in your buffer through the DMA channel. No copying. If Linux, check out memory mapped files. Here the OS is using the virtual memory hardware to do more-or-less what Windows does with overlapping.
Code your own integer conversion. This is likely to be a bit faster than making a clib call per integer.
Here's example code. You want to absolutely limit the number of comparisons.
// Process one input buffer.
*end_buf = ' '; // add a sentinel at the end of the buffer
for (char *p = buf; p < end_buf; p++) {
// somewhat unsafe (but fast) reliance on unsigned wrapping
unsigned val = *p - '0';
if (val <= 9) {
// Found start of integer.
for (;;) {
unsigned digit_val = *p - '0';
if (digit_val > 9) break;
val = 10 * val + digit_val;
p++;
}
... do something with val
}
}
Don't call malloc once per record. You should allocate blocks of many structs at a time.
Experiment with buffer sizes.
Crank up compiler optimizations. This is the kind of code that benefits greatly from excellent code generation.

Yes, standard library conversion functions are surprisingly slow.
If portability is not a problem, I'd memory-map the file. Then, something like the following C99 code (untested) could be used to parse the entire memory map:
#include <stdlib.h>
#include <errno.h>
struct pair {
unsigned long key;
unsigned long value;
};
typedef struct {
size_t size; /* Maximum number of items */
size_t used; /* Number of items used */
struct pair item[];
} items;
/* Initial number of items to allocate for */
#ifndef ITEM_ALLOC_SIZE
#define ITEM_ALLOC_SIZE 8388608
#endif
/* Adjustment to new size (parameter is old number of items) */
#ifndef ITEM_REALLOC_SIZE
#define ITEM_REALLOC_SIZE(from) (((from) | 1048575) + 1048577)
#endif
items *parse_items(const void *const data, const size_t length)
{
const unsigned char *ptr = (const unsigned char *)data;
const unsigned char *const end = (const unsigned char *)data + length;
items *result;
size_t size = ITEMS_ALLOC_SIZE;
size_t used = 0;
unsigned long val1, val2;
result = malloc(sizeof (items) + size * sizeof (struct pair));
if (!result) {
errno = ENOMEM;
return NULL;
}
while (ptr < end) {
/* Skip newlines and whitespace. */
while (ptr < end && (*ptr == '\0' || *ptr == '\t' ||
*ptr == '\n' || *ptr == '\v' ||
*ptr == '\f' || *ptr == '\r' ||
*ptr == ' '))
ptr++;
/* End of data? */
if (ptr >= end)
break;
/* Parse first number. */
if (*ptr >= '0' && *ptr <= '9')
val1 = *(ptr++) - '0';
else {
free(result);
errno = ECOMM; /* Bad data! */
return NULL;
}
while (ptr < end && *ptr >= '0' && *ptr <= '9') {
const unsigned long old = val1;
val1 = 10UL * val1 + (*(ptr++) - '0');
if (val1 < old) {
free(result);
errno = EDOM; /* Overflow! */
return NULL;
}
}
/* Skip whitespace. */
while (ptr < end && (*ptr == '\t' || *ptr == '\v'
*ptr == '\f' || *ptr == ' '))
ptr++;
if (ptr >= end) {
free(result);
errno = ECOMM; /* Bad data! */
return NULL;
}
/* Parse second number. */
if (*ptr >= '0' && *ptr <= '9')
val2 = *(ptr++) - '0';
else {
free(result);
errno = ECOMM; /* Bad data! */
return NULL;
}
while (ptr < end && *ptr >= '0' && *ptr <= '9') {
const unsigned long old = val2;
val1 = 10UL * val2 + (*(ptr++) - '0');
if (val2 < old) {
free(result);
errno = EDOM; /* Overflow! */
return NULL;
}
}
if (ptr < end) {
/* Error unless whitespace or newline. */
if (*ptr != '\0' && *ptr != '\t' && *ptr != '\n' &&
*ptr != '\v' && *ptr != '\f' && *ptr != '\r' &&
*ptr != ' ') {
free(result);
errno = ECOMM; /* Bad data! */
return NULL;
}
/* Skip the rest of this line. */
while (ptr < end && *ptr != '\n' && *ptr != '\r')
ptr++;
}
/* Need to grow result? */
if (used >= size) {
items *const old = result;
size = ITEMS_REALLOC_SIZE(used);
result = realloc(result, sizeof (items) + size * sizeof (struct pair));
if (!result) {
free(old);
errno = ENOMEM;
return NULL;
}
}
result->items[used].key = val1;
result->items[used].value = val2;
used++;
}
/* Note: we could reallocate result here,
* if memory use is an issue.
*/
result->size = size;
result->used = used;
errno = 0;
return result;
}
I've used a similar approach to load molecular data for visualization. Such data contains floating-point values, but precision is typically only about seven significant digits, no multiprecision math needed. A custom routine to parse such data beats the standard functions by at least an order of magnitude in speed.
At least the Linux kernel is pretty good at observing memory/file access patterns; using madvise() also helps.
If you cannot use a memory map, then the parsing function would be a bit different: it would append to an existing result, and if the final line in the buffer is partial, it would indicate so (and the number of chars not parsed), so that the caller can memmove() the buffer, read more data, and continue parsing. (Use 16-byte aligned addresses for reading new data, to maximize copy speeds. You don't necessarily need to move the unread data to the exact beginning of the buffer, you see; just keep the current position in the buffered data.)
Questions?

First, what's your disk hardware? A single SATA drive is likely to be topped out at 100 MB/sec. And probably more like 50-70 MB/sec. If you're already moving data off the drive(s) as fast as you can, all the software tuning you do is going to be wasted.
If your hardware CAN support reading faster? First, your read pattern - read the whole file into memory once - is the perfect use-case for direct IO. Open your file using open( "/file/name", O_RDONLY | O_DIRECT );. Read to page-aligned buffers (see man page for valloc()) in page-sized chunks. Using direct IO will cause your data to bypass double buffering in the kernel page cache, which is useless when you're reading that much data that fast and not re-reading the same data pages over and over.
If you're running on a true high-performance file system, you can read asynchronously and likely faster with lio_listio() or aio_read(). Or you can just use multiple threads to read - and use pread() so you don't have waste time seeking - and because when you read using multiple threads a seek on an open file affects all threads trying to read from the file.
And do not try to read fast into a newly-malloc'd chunk of memory - memset() it first. Because truly fast disk systems can pump data into the CPU faster than the virtual memory manager can create virtual pages for a process.

Very Slow Data Processing

Consider the following code that loads a dataset of records into a buffer and creates a Record object for each record. A record constitutes one or more columns and this information is uncovered at run-time. However, in this particular example, I have set the number of columns to 3.
typedef unsigned int uint;
typedef struct
{
uint *data;
} Record;
Record *createNewRecord (short num_cols);
int main(int argc, char *argv[])
{
time_t start_time, end_time;
int num_cols = 3;
char *relation;
FILE *stream;
int offset;
char *filename = "file.txt";
stream = fopen(filename, "r");
fseek(stream, 0, SEEK_END);
long fsize = ftell(stream);
fseek(stream, 0, SEEK_SET);
if(!(relation = (char*) malloc(sizeof(char) * (fsize + 1))))
printf((char*)"Could not allocate buffer");
fread(relation, sizeof(char), fsize, stream);
relation[fsize] = '\0';
fclose(stream);
char *start_ptr = relation;
char *end_ptr = (relation + fsize);
while (start_ptr < end_ptr)
{
Record *new_record = createNewRecord(num_cols);
for(short i = 0; i < num_cols; i++)
{
sscanf(start_ptr, " %u %n",
&(new_record->data[i]), &offset);
start_ptr += offset;
}
}
Record *createNewRecord (short num_cols)
{
Record *r;
if(!(r = (Record *) malloc(sizeof(Record))) ||
!(r->data = (uint *) malloc(sizeof(uint) * num_cols)))
{
printf(("Failed to create new a record\n");
}
return r;
}
This code is highly inefficient. My dataset contains around 31 million records (~1 GB) and this code processes only ~200 records per minute. The reason I load the dataset into a buffer is because I'll later have multiple threads process the records in this buffer and hence I want to avoid files accesses. Moreover, I have a 48 GB RAM, so the dataset in memory should not be a problem. Any ideas on how can to speed things up??
SOLUTION: the sscanf function was actually extremely slow and inefficient.. When I switched to strtoul, the job finishes in less than a minute. Malloc-ing ~ 3 million structs of type Record took only few seconds.

Confident that a lurking non-numeric data exist in the file.
int offset;
...
sscanf(start_ptr, " %u %n", &(new_record->data[i]), &offset);
start_ptr += offset;
Notice that if the file begins with non-numeric input, offset is never set and if it had the value of 0, start_ptr += offset; would never increment.
If a non-numeric data exist later in the file like "3x", offset will get the value of 1, and cause the while loop to proceed slowly for it will never get an updated value.
Best to check results of fread(), ftell() and sscanf() for unexpected return values and act accordingly.
Further: long fsizemay be too small a size. Look to using fgetpos() and fsetpos().
Note: to save processing time, consider using strtoul() as it is certainly faster than sscanf(" %u %n"). Again - check for errant results.
BTW: If code needs to uses sscanf(), use sscanf("%u%n"), a tad faster and for your code and the same functionality.

I'm not an optimization professional but I think some tips should help.
First of all, I suggest you use filename and num_cols as macros because they tend to be faster as literals when I don't see you changing their values in code.
Seond, using a struct for storing only one member is generally not recommended, but if you want to use it with functions you should only pass pointers. Since I see you're using malloc to store a struct and again for storing the only member then I suppose that is the reason why it is too slow. You're using twice the memory you need. This might not be the case with some compilers, however. Practically, using a struct with only one member is pointless. If you want to ensure that the integer you get (in your case) is specifically a record, you can typedef it.
You should also make end_pointer and fsize const for some optimization.
Now, as for functionality, have a look at memory mapping io.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight