I have a piece of code that analyzes streams of data from very large (10-100GB) binary files. It works well, so it's time to start optimizing, and currently disk IO is the biggest bottleneck.
There are two types of files in use. The first type of file consists of a stream of 16-bit integers, which must be scaled after I/O to convert to a floating point value which is physically meaningful. I read the file in chunks, and I read in the chunks of data by reading one 16-bit code at a time, performing the required scaling, and then storing the result in an array. Code is below:
int64_t read_current_chimera(FILE *input, double *current,
int64_t position, int64_t length, chimera *daqsetup)
{
int64_t test;
uint16_t iv;
int64_t i;
int64_t read = 0;
if (fseeko64(input, (off64_t)position * sizeof(uint16_t), SEEK_SET))
{
return 0;
}
for (i = 0; i < length; i++)
{
test = fread(&iv, sizeof(uint16_t), 1, input);
if (test == 1)
{
read++;
current[i] = chimera_gain(iv, daqsetup);
}
else
{
perror("End of file reached");
break;
}
}
return read;
}
The chimera_gain function just takes a 16-bit integer, scales it and returns the double for storage.
The second file type contains 64-bit doubles, but it contains two columns, of which I only need the first. To do this I fread pairs of doubles and discard the second one. The double must also be endian-swapped before use. The code I use to do this is below:
int64_t read_current_double(FILE *input, double *current, int64_t position, int64_t length)
{
int64_t test;
double iv[2];
int64_t i;
int64_t read = 0;
if (fseeko64(input, (off64_t)position * 2 * sizeof(double), SEEK_SET))
{
return 0;
}
for (i = 0; i < length; i++)
{
test = fread(iv, sizeof(double), 2, input);
if (test == 2)
{
read++;
swapByteOrder((int64_t *)&iv[0]);
current[i] = iv[0];
}
else
{
perror("End of file reached: ");
break;
}
}
return read;
}
Can anyone suggest a method of reading these file types that would be significantly faster than what I am currently doing?
First off, it would be useful to use a profiler to identify the hot spots in your program. Based on your description of the problem, you have a lot of overhead going on by the sheer number of freads. As the files are large there will be a big benefit to increasing the amount of data you read per io.
Convince yourself of this by putting together 2 small programs that read the stream.
1) read it as you are in the example above, of 2 doubles.
2) read it the same way, but make it 10,000 doubles.
Time both runs a few times, and odds are you will be observe #2 runs much faster.
Best of luck.
Related
I am measuring the latency of some operations.
There are many scenarios here.
The delay of each scene is roughly distributed in a small interval. For each scenario, I need to measure 500,000 times. Finally I want to output the delay value and its corresponding number of times.
My initial implementation was:
#define range 1000
int rec_array[range];
for (int i = 0; i < 500000; i++) {
int latency = measure_latency();
rec_array[latency]++;
}
for (int i = 0; i < range; i++) {
printf("%d %d\n", i, rec_array[i]);
}
But this approach was fine at first, but as the number of scenes grew, it became problematic.
The delay measured in each scene is concentrated in a small interval. So for most of the data in the rec_array array is 0.
Since each scene is different, the delay value is also different. Some delays are concentrated around 500, and I need to create an array with a length greater than 500. But some are concentrated around 5000, and I need to create an array with a length greater than 5000.
Due to the large number of scenes, I created too many arrays. For example I have ten scenes and I need to create ten rec_arrays. And I also set them to be different lengths.
Is there any efficient and convenient strategy? Since I am using C language, templates like vector cannot be used.
I considered linked lists. However, considering that the interval of the delay value distribution is uncertain, and how many certain delay values are uncertain, and when the same delay occurs, the timing value needs to be increased. It also doesn't seem very convenient.
I'm sorry, I just went out. Thank you for your help. I read the comments carefully. Here are some of my answers.
These data are mainly used to draw pictures,For example, this one below.
The comment area says that data seems small. The main reason why I thought about this problem is that according to the picture, only a few arrays are used each time, and the vast majority are 0. And there are many scenarios where I need to generate an array for each. I have referred to an open source implementation.
According to the comments, it seems that using arrays directly is a good solution, considering fast access. Thanks veru much!
A linked list is probably (and almost always) the least efficient way to store things – both slow as hell, and memory inefficient, since your values use less storage than your pointers. Linked lists are very rarely a good solution for anything that actually stores significant data. The only reason they're so prevalent is that C still has no proper containers, and they're easy wheels to
reinvent for every single C program you write.
#define range 1000
int rec_array[range];
So you're (probably! This depends on your compiler and where you write int rec_array[range];) storing rec_array on the stack, and it's large. (Actually, 4000 Bytes is not "large" by any modern computer's means, but still.) You should not be doing that; instead, this should be heap allocated, once, at initialization.
The solution is to allocate it:
/* SPDX-License-Identifier: LGPL-2.1+ */
/* Copyright Marcus Müller and others */
#include <stdlib.h>
#define N_RUNS 500000
/*
* Call as
* program maximum_latency
*/
unsigned int *run_benchmark(struct task_t task, unsigned int *latencies,
unsigned int *max_latency) {
for (unsigned int run = 0; run < N_RUNS; ++run) {
unsigned int latency = measure_latency();
if (latency >= *max_latency) {
latency = *max_latency - 1;
/*
* alternatively: use realloc to increase the size of the `latencies`,
* and update max_latency as well; that's basically what C++ std::vector
* does
*/
(latencies[latency])++;
}
}
return latencies;
}
int main(int argc, char **argv) {
// check argument
if (argc != 2) {
exit(127);
}
int maximum_latency_raw = atoi(argv[1]);
if (maximum_latency_raw <= 0) {
exit(126);
}
unsigned int maximum_latency = maximum_latency_raw;
/*
* note that the length does no longer have to be a constant
* if you're using calloc/malloc.
*/
unsigned int *latency_counters =
(unsigned int *)calloc(maximum_latency, sizeof(unsigned int));
for (; /* benchmark task in benchmark_tasks */;) {
run_benchmark(task, latency_counters, &maximum_latency);
print_benchmark_result(latency_counters, maximum_latency);
// clear our counters after each run!
memset(latency_counters, 0, maximum_latency * sizeof(unsigned int));
}
}
void print_benchmark_result(unsigned int *array, unsigned int length) {
for (unsigned int index = 0; index < length; ++index) {
printf("%d %d\n", i, rec_array[i]);
}
puts("============================\n");
}
Note especially the "alternatively: realloc" comment in the middle: realloc allows you to increase the size of your array:
unsigned int *run_benchmark(struct task_t task, unsigned int *latencies,
unsigned int *max_latency) {
for (unsigned int run = 0; run < N_RUNS; ++run) {
unsigned int latency = measure_latency();
if (latency >= *max_latency) {
// double the size!
latencies = (unsigned int *)realloc(latencies, (*max_latency) * 2 *
sizeof(unsigned int));
// realloc doesn't zero out the extension, so we need to do that
// ourselves.
memset(latencies + (*max_latency), 0, (*max_latency)*sizeof(unsigned int);
(*max_latency) *= 2;
(latencies[latency])++;
}
}
return latencies;
}
This way, your array grows when you need it to!
how about using a Hash table so we would only save the latency used and maybe the keys in the Hash table can be ranges while the values of said keys be the actual latency?
Just sacrifice some precision in your latencies like 0-15, 16-31, 32-47 ... etc. Now your array will be 16x smaller.
Allocate all latency counter arrays for all scenes in one go
unsigned int *latency_div16_counter = (unsigned int *)calloc((MAX_LATENCY >> 4) * NUM_OF_SCENES, sizeof(unsigned int));
Clamp the values to the max latency, div 16 and store
for (int scene = 0; scene < NUM_OF_SCENES; scene++) {
for (int i = 0; i < 500000; i++) {
int latency = measure_latency();
if(latency >= MAX_LATENCY) latency = MAX_LATENCY - 1;
latency = latency >> 4; // int div 16
latency_div16_counter[(scene * MAX_LATENCY) + latency]++;
}
}
Adjust the data (mul 16) before displaying it
for (int scene = 0; scene < NUM_OF_SCENES; scene++) {
for (int i = 0; i < (MAX_LATENCY >> 4); i++) {
printf("Scene %d Latency %d Total %d\n", scene, i * 16, latency_div16_counter[i]);
}
}
Consider the following code that loads a dataset of records into a buffer and creates a Record object for each record. A record constitutes one or more columns and this information is uncovered at run-time. However, in this particular example, I have set the number of columns to 3.
typedef unsigned int uint;
typedef struct
{
uint *data;
} Record;
Record *createNewRecord (short num_cols);
int main(int argc, char *argv[])
{
time_t start_time, end_time;
int num_cols = 3;
char *relation;
FILE *stream;
int offset;
char *filename = "file.txt";
stream = fopen(filename, "r");
fseek(stream, 0, SEEK_END);
long fsize = ftell(stream);
fseek(stream, 0, SEEK_SET);
if(!(relation = (char*) malloc(sizeof(char) * (fsize + 1))))
printf((char*)"Could not allocate buffer");
fread(relation, sizeof(char), fsize, stream);
relation[fsize] = '\0';
fclose(stream);
char *start_ptr = relation;
char *end_ptr = (relation + fsize);
while (start_ptr < end_ptr)
{
Record *new_record = createNewRecord(num_cols);
for(short i = 0; i < num_cols; i++)
{
sscanf(start_ptr, " %u %n",
&(new_record->data[i]), &offset);
start_ptr += offset;
}
}
Record *createNewRecord (short num_cols)
{
Record *r;
if(!(r = (Record *) malloc(sizeof(Record))) ||
!(r->data = (uint *) malloc(sizeof(uint) * num_cols)))
{
printf(("Failed to create new a record\n");
}
return r;
}
This code is highly inefficient. My dataset contains around 31 million records (~1 GB) and this code processes only ~200 records per minute. The reason I load the dataset into a buffer is because I'll later have multiple threads process the records in this buffer and hence I want to avoid files accesses. Moreover, I have a 48 GB RAM, so the dataset in memory should not be a problem. Any ideas on how can to speed things up??
SOLUTION: the sscanf function was actually extremely slow and inefficient.. When I switched to strtoul, the job finishes in less than a minute. Malloc-ing ~ 3 million structs of type Record took only few seconds.
Confident that a lurking non-numeric data exist in the file.
int offset;
...
sscanf(start_ptr, " %u %n", &(new_record->data[i]), &offset);
start_ptr += offset;
Notice that if the file begins with non-numeric input, offset is never set and if it had the value of 0, start_ptr += offset; would never increment.
If a non-numeric data exist later in the file like "3x", offset will get the value of 1, and cause the while loop to proceed slowly for it will never get an updated value.
Best to check results of fread(), ftell() and sscanf() for unexpected return values and act accordingly.
Further: long fsizemay be too small a size. Look to using fgetpos() and fsetpos().
Note: to save processing time, consider using strtoul() as it is certainly faster than sscanf(" %u %n"). Again - check for errant results.
BTW: If code needs to uses sscanf(), use sscanf("%u%n"), a tad faster and for your code and the same functionality.
I'm not an optimization professional but I think some tips should help.
First of all, I suggest you use filename and num_cols as macros because they tend to be faster as literals when I don't see you changing their values in code.
Seond, using a struct for storing only one member is generally not recommended, but if you want to use it with functions you should only pass pointers. Since I see you're using malloc to store a struct and again for storing the only member then I suppose that is the reason why it is too slow. You're using twice the memory you need. This might not be the case with some compilers, however. Practically, using a struct with only one member is pointless. If you want to ensure that the integer you get (in your case) is specifically a record, you can typedef it.
You should also make end_pointer and fsize const for some optimization.
Now, as for functionality, have a look at memory mapping io.
I've got the following code to read and process multiple very big files one after another.
for(j = 0; j < CORES; ++j) {
double time = omp_get_wtime();
printf("File: %d, time: %f\n", j, time);
char in[256];
sprintf(in, "%s.%d", FIN, j);
FILE* f = fopen(in, "r");
if (f == NULL)
fprintf(stderr, "open failed: %s\n", FIN);
int i;
char buffer[1024];
char* tweet;
int takeTime = 1;
for (i = 0, tweet = TWEETS + (size_t)j*(size_t)TNUM*(size_t)TSIZE; i < TNUM; i++, tweet += TSIZE) {
double start;
double end;
if(takeTime) {
start = omp_get_wtime();
takeTime = 0;
}
char* line = fgets(buffer, 1024, f);
if (line == NULL) {
fprintf(stderr, "error reading line %d\n", i);
exit(2);
}
int fn = readNumber(&line);
int ln = readNumber(&line);
int month = readMonth(&line);
int day = readNumber(&line);
int hits = countHits(line, key);
writeTweet(tweet, fn, ln, hits, month, day, line);
if(i%1000000 == 0) {
end = omp_get_wtime();
printf("Line: %d, Time: %f\n", i, end-start);
takeTime = 1;
}
}
fclose(f);
}
Every file contains 24000000 tweets and I read 8 files in total, one after another.
Each line (1 tweet) gets processed and writeTweet() copies a modified line in one really big char array.
As you can see, I measure the times to see how long it takes to read and process 1 million tweets. For the first file, its about 0.5 seconds per 1 million, which is fast enough. But after every additional file, it takes longer and longer. File 2 takes about 1 second per 1 million lines (but not everytime, just some of the iterations), up to 8 seconds on file number 8. Is this to be expected? Can I speed things up? All files are more or less completely the same, always with 24 million lines.
Edit:
Additional information: Every file needs, in processed form, about 730MB RAM. That means, using 8 files we end up with memory need of about 6GB.
As wished, the content of writeTweet()
void writeTweet(char* tweet, const int fn, const int ln, const int hits, const int month, const int day, char* line) {
short* ptr1 = (short*) tweet;
*ptr1 = (short) fn;
int* ptr2 = (int*) (tweet + 2);
*ptr2 = ln;
*(tweet + 6) = (char) hits;
*(tweet + 7) = (char) month;
*(tweet + 8) = (char) day;
int i;
int n = TSIZE - 9;
for (i = strlen(line); i < n; i++)
line[i] = ' '; // padding
memcpy(tweet + 9, line, n);
}
Probably, writeTweet() is a bottleneck. If you copy all processed tweets in memory, the huge data array with which the operating system has to do something is formed over time. If you have not enough memory, or other processes in system actively use it, OS will dump (in most cases) part of data on a disk. It increases time of access to the array. There is a more hidden from user eyes mechanisms in OS which can affect performance.
You shouldn't store all processed lines in memory. The simplest way: to dump the processed tweets on a disk (write a file). However solution depends on how you use further the processed tweets. If you not sequentially use data from array, it is worth thinking of special data structure to storage (B-trees?) . Already there is a many libraries for this purpose -- better to look for them.
UPD:
Modern OSs (Linux including) use virtual memory model. For maintenance of this model in a kernel there is a special memory manager who creates special structures of references to real pages in memory. Usually it's maps, for large memory volumes they referenced to sub-maps -- it is rather big branched structure.
During work with a big piece of memory it is necessary to address to any pages of memory often randomly. For address acceleration OS uses a special cache. I don't know all subtleties of this process, but I think that in this case cache should be often invalidate because there is no memory for storage all references at the same time. It is expensive operation brings to performance reduction. It will be that more, than memory is more much used.
If you need to sort large tweets array, it isn't obligatory for you to store everything in memory. There are ways to sorting data on a disk. If you want to sort data in memory, it isn't necessary to do the real swap operations on array elements. It's better to use intermediate structure with references to elements in tweets array, and to sort references instead of data.
Im trying to implement External Sorting in C.
I have to read N integers (fixed depending on main memory) from a file initially so that I can apply quicksort on them and then continue with the merging process.
I can think of these 2 ways:
read N integers one by one from the file and put them in an array then sort them.
read a bulk of data into a big char array and then reading integers from it using sscanf.
1st method is clearly slow and 2nd method is using lot of extra memory (but we have a limited main memory)
Is there any better way?
Don't try to be more clever than your OS, it probably supports some clever memory management functions, which will make your life easier, and your code faster.
Assuming you are using a POSIX compliant operating system, then you can use mmap(2).
Map your file into memory with mmap
Sort it
Sync it
This way the OS will handle swapping out data when room is tight, and swap it in when you need it.
Since stdio file operations are buffered, you won't really need to worry about the first option, especially if the file isn't huge. Remember you're not operating directly on a file, but a representation of that file in memory.
For example, if you scan in one number at a time, the system will read in a much larger section from the file (on my system it's 4096 bytes, or the entire file if it's shorter).
you can use below function to read ints from file one by one and continue sorting and merging on the go....
the function takes filename and integer count as argument and it returns int from file.
int read_int (const char *file_name, int count)
{
int err = -1;
int num = 0;
int fd = open(filename, O_RDONLY);
if(fd < 0)
{
printf("error opening file\n");
return (fd);
}
err = pread(fd, &num, sizeof(int), count*sizeof(int));
if(err < 0)
{
printf("End of file reached\n");
return (err);
}
close(fd);
return (num);
}
Sort at the same time you read is the best way. and save your data into linked list instead of array is more efficient in the sort
you can use fscanf() to read integer by integer from file. and try to sort at the moment you read integer from the file. I mean when you read integer from the file put it in the array in the right place to get the array sorted when you finish reading.
The following example read from file integer by integer and insert them with sort at the same time of reading. the integer are saved into arrays and not into linked list
void sort_insert(int x, int *array, int len)
{
int i=0, j;
for(i=0; i<(len-1); i++)
{
if (x > array[i])
continue;
for (j=(len-1); j>i; j--)
array[j] = array[j-1];
break;
}
array[i] = x;
}
void main() {
int x, i;
int len = 0;
int array[50];
FILE *fp = fopen("myfile.txt", "r");
while (len<50 && fscanf(fp, " %d",&x)>0)
{
len++;
sort_insert(x, array, len);
}
for (i=0; i<len; i++)
{
printf("array[%d] = %d\n", i, array[i]);
}
}
I have a simulation program written in c and I need to create random numbers and write them to a txt file. Program only stops
- when a random number already generated is generated again or
- 1 billion random number are generated (no repetition)
My problem is that I could not search the generated long int random number in the txt file!
Text file format is:
9875
764
19827
2332
...
Any help is appreciated..
`
FILE * out;
int checkNumber(long int num){
char line[512];
long int number;
int result=0;
if((out = fopen("out.txt","r"))==NULL){
result= 1;
}
char buf[10];
itoa(num, buf, 10);
while(fgets(line, 512, out) != NULL)
{
if((strstr(line,buf)) != NULL){
result = 0;
}
}
if(out) {
fclose(out);
}
return result;
}
int main(){
int seed;
long int nRNs=0;
long int numberGenerated;
out = fopen ("out.txt","w");
nRNs=0;
seed = 12345;
srand (seed);
fprintf(out,"%d\n",numberGenerated);
while( nRNs != 1000000000 )
{
numberGenerated = rand();
nRNs++;
if(checkNumber(numberGenerated)==0){
fclose(out); break; system("pause");
}
else{
fprintf(out,"%d\n",numberGenerated);
}
}
fclose(out);
}`
If the text file only contains randomly generated numbers separated by space, then you need strtok() function(google its usage) and throw it into the binary tree structure as mentioned by #jacekmigacz. But in any circumstance, you will have to search the whole file once at least. Then ftell() the value to get the location you've searched for in the file. When another number is generated you can use fseek() to get the latest number. Remember to get the data line by line with fgets()
Take care of the memory requirements and use malloc() judiciously
Try with tree (data structure).
Searching linearly through the text file every time is gonna take forever with so many numbers. You could hold every number generated so far sorted in a data structure so that you can do a binary search for a duplicate. This is going to need a lot of RAM though. For 1 billion integers that's already 4GB on a system with 32-bit integers, and you'll need several more for the data structure overhead. My estimate is around 16GB in the worst case scenario (where you actually get to 1 billion unique integers.)
If you don't have a memory monster machine, you should instead write the data structure to a binary file and do the binary search there. Though that's still gonna be quite slow.
This may work or you can approach like this : (slow but will work)
int new_rand = rand();
static int couter = 0;
FILE *fptr = fopen("txt","a+");
int i;
char c,buf[10];
while((c=getc(fptr))!=EOF)
{
buf[j++]=c;
if(c == ' ')
{
buf[--j]='\0';
i=atoi(buf);
if(i == new_rand)
return;
j=0;
}
if(counter < 1000000)
{
fwrite(&new_rand, 4, 1, fptr);
counter++;
}
Don't open and scan your file to checkNumber(). You'll be waiting forever.
Instead, keep your generated numbers in memory using a bit set data structure and refer to that.
Your bit set will need to be large enough to indicate every 32-bit integer, so it'll consume 2^32 / 8 bytes (or 512MiB) of memory. This may seem like a lot but it's much smaller than 32-bit * 1,000,000,000 (4GB). Also, both checking and updating will be done in constant time.
Edit: The wikipedia link doesn't do much to explain how to code one, so here's a rough sample: (There're faster ways of writing this, e.g.: using bit shifts instead of division, but this should be easier to understand.)
int checkNumberOrUpdate(char *bitSet, long int num){
char b = 1 << (num % 8);
char w = num / 8;
if (bitSet[w] & ~b) {
return 1;
}
bitSet[w] |= b;
return 0;
}
Note, bitSet needs to be calloc()d to the right size from your main function.