performance - read huge FASTA file line by line in C

performance - read huge FASTA file line by line in C - c

I have a FASTA file that contain sequence strings upto 2000000 strings [lines]. I wrote the code that works well with smaller size but when the size of file grows it get slower (Even slower then the smaller size file speed). I am confused that why it take more time when file size let see is 100,000 even for first iteration that run very efficiently in case of 10,000.
For Example: I put printf statement for each iteration. In case of 10,000 first iteration take 2 ms. where as in case of 100000 strings even the first iteration will take more time then 2 ms to print and so on. Why it could be slow like that?
Can you please help me how I can make it efficient or even work at the same speed as it do with smaller size file? I am reading it line by line.
My code is
#include "kseq.h"
KSEQ_INIT(gzFile, gzread)
int z=0;
fp = gzopen(dbFile, "r"); //Read database Fasta file into host memory
seq_d = kseq_init(fp);
while ((d = kseq_read(seq_d)) >= 0) {
unsigned char *b = (unsigned char *)malloc(sizeof(unsigned char) * 256);
memcpy(b, seq_d->seq.s, 256);
....
do work with b
....
............
z++
free(b);
}
kseq_destroy(seq_d);
gzclose(fp);

I have found the issue. I didn't notice before but in my code there were two loops that actually run to the size of file and don't needed (That,s why I got the variable time for each iteration too). I just eliminate them and now it work perfect.

to improve the speed you can also move the malloc line before the 'while', and the free after the end of the 'while'.

Related

C freads function returns wrong file binary data [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have wrote an binary file using c and checked it using xxd with following output:
6161 6161 6161 6161 6161 6161 6161 6161
Now my Code :
FILE * read;
int * filebytes;
read = fopen("data/myfile","rb");
fread(filebytes,4,4,read);
for(int x = 0; x < var; x++){
printf("%x\n",filebytes[x]);
}
fclose(read);
An im getting that output:
61616161
61616161
61616171
61616161
Now comes the even weirder thing, if i now not read out 4 times 32 bit but 5 times the last byte of the 3 row is not 71 but 75 ,reading out 6 times 32 bit 79 and so forth (every time adding 4). I cant think of a reason why that would happen every time at the 3th element im reading.
I would like to know how to read out a file into 32 pieces and not getting weird changes into my code.
Happy about every type of help.

"man fread" says this:
#include <stdio.h>
size_t fread(void *BUF, size_t SIZE, size_t COUNT, FILE *FP);
where:
BUF - pointer to output memory
SIZE - number of bytes in 1 "element"
COUNT - number of 'elements' to read
FP - open file pointer to read from
Here you have SIZE=4, COUNT=4, so you are trying to read 16 bytes.
But you are reading into memory location pointed to by filebytes - but this is currently just a random number, so could be pointing to anywhere in memory.
As a result, your fread() command could:
end up crashing as the memory location 'filebytes' could point outside the address space
end up pointing to memory that other code will change later (eg as in your case)
just by sheer chance not be used by any other part of the code, and by luck work.
filebytes needs to point to valid memory, eg:
on the heap: filebytes= (int *) malloc(16);
or on the stack, by making filebytes an local array to the function like:
int filebytes[16];
Note that sizeof(int) might vary per machine/architecture, so "int filebytes[16] will allocate at least 16 bytes (as int will be at least 1 byte, normally a minimum of 2.)
(Note also that 'var' is not defined in your example - this should be defined. But even with var defined, this might not make sense:
for(int x = 0; x < var; x++){
printf("%x\n",filebytes[x]);
}
because you haven't said what you are trying to do.
E.g. how many bytes in the input file represent 'one integer' ? This could be any number theoretically.
And if say 4 bytes represent 'one integer' then is the most significant byte first, or the least significant byte first ?
And note that your machine might be a big-endian or litte-endian machine, so for example if you run on a little-endian machine (least significant byte first), but your file has most-significant byte first, then loading directly into the integer array will not give the correct answer. Or even if they are the same, if you later run the same program on a different machine (with the opposite endian-ness), it will start breaking.
You need to decide:
how many bytes are in your integers in the file ?
are they most-significant byte first, or least significant byte first ?
Then you could load into a 'unsigned char' array, and then manually build the integer from that. (That will work no matter endian of the machine.)

Automatically appending numbers to variables in C

Our programming assignment asked us to break a text file into a set of smaller files having names (filename)partx.txt. For example if the argument passed to the program is a text file named stack.txt then the output should be stackpart1.txt, stackpart2.txt etc, where each file is of size 250 bytes max.
What is the best way to attain the part_x thing ?
I learned about using macro with ## to attain that. What are the drawbacks of this method and is there any better way ?
Is it a good practise to generate variable names this way ?

Don't confuse variable names with their content; macros and variable names have nothing to do with your assignment. ## is used to join strings to be used in your code at compile-time (a typical usage is to build identifiers or in general code parametrically in macros), which is a relatively rare and very specialized task.
What you want to do, instead, is to generate strings at runtime based on a pattern (=> you'll have the same string variable that you'll fill with different stuff at each iteration); the right function for this is snprintf.

It's perfectly simple, I'd say: You open a file (fopen returns a FILE *) which you can then read in a loop, using fread to specify the max amount of bytes to read on each iteration. Given the fact you're using a loop anyway, you can increment a simple int to keep track of the chunk-file names, using snprintf to create the name, write the characters read by fread to each file, and continue until you're done.
Some details on fread that might be useful to you
A basic example (needs some work, still):
int main( void )
{
int chunk_count = 0, chunk_size = 256;
char buffer[256]
FILE *src_fp,
*target_fp;
char chunk_name[50];
while (chunk_size == fread(buffer, chunk_size, 1, src_fp))
{//read chunk
++chunk_count;//increase chunk count
snprintf(chunk_name, 50, "chunk_part%d.txt", chunk_count);
target_fp = fopen(chunk_name, "w");
//write to chunk file
fwrite(buffer, chunk_size, 1, target_fp);
fclose(target_fp);//close chunk file
}
//don't forget to write the last chunk, if it's not 0 in length
if (chunk_size)
{
++chunk_count;//increase chunk count
snprintf(chunk_name, 50, "chunk_part%d.txt", chunk_count);
target_fp = fopen(chunk_name, "w");
//write to chunk file
fwrite(buffer, strlen(buffer) + 1, 1, target_fp);
fclose(target_fp);//close chunk file
}
fclose(src_fp);
printf("Written %d files, each of max 256 bytes\n", chunk_count);
return 0 ;
}
Note that this code is not exactly safe to use as it stands. You'll need to check the return values of fopen (it can, and at some point will, return NULL). The fread-based loop simply assumes that, if its return value is less than the chunk size, we've reached the end of the source-file, which isn't always the case. you'll have to handle NULL pointers and ferror stuff yourself, still. Either way, the functions to look into are:
fread
fopen
fwrite
fclose
ferror
snprintf
That should do it.
Update, just for the fun of it.
You might want to pad the numbers of your chunk file names (chunk_part0001.txt). To do this, you can try to predict how big the source file is, divide that by 256 to work out how many chunks you're actually going to end up with and use that amount of padding zeroes. How to get the file size is explained here, but here's some code I some time ago:
long file_size = 0,
factor = 10;
int padding_cnt = 1;//at least 1, ensures correct padding
fseek(src_fp, 0, SEEK_END);//go to end of file
file_size = ftell(src_fp);
file_size /= 256;//divided by chunk size
rewind(src_fp);//return to beginning of file
while(10 <= (file_size/factor))
{
factor *= 10;
++padding_cnt;
}
//padded chunk file names:
snprintf(chunk_name, sizeof chunk_name, "chunk_part%0*d.txt", padding_cnt, chunk_count);
If you want, I could explain every single statement, but the gist of it is this:
fseek + ftell gets to size of the file (in bytes), divided by the chunk size (256) gets you the total number of chunks you'll create (+1 for the remainder, which is why padding_cnt is initialized to 1)
The while loop divides the total count by 10^n, each time the factor is multiplied by 10, the padding count increases
the format passed to snprintf changed to %0*d which means: _"print an int, padded by n occurrences of 0 (ie to a fixed width). If you end up with 123 chunks, the first chunk file will be called chunk_part001.txt, the tenth file will be chunk_part010.txt all the way up to chunk_part100.txt.
refer to the linked question, the accepted answer uses sys/stat.h to get the file-size, which is more reliable (though it can pose some minor portability issues) Check the stat wiki for alternatives
Why? Because it's fun, and it makes the output files easier to sort by name. It also enables you to predict how big the char array that holds the target file name should be, so if you have to allocate that memory using malloc, you know exactly how much memory you'll need, and don't have to allocate 100 chars (which should be enough either way), and hope that you don't run out of space.
Lastly: the more you know, the better IMO, so I thought I'd give you some links and refs you might want to check.

You can either:
Use a MACRO as suggested (Compile-time). This involves some amount of knowledge to be present regarding the filesize (and numbers for sub-files) while implementing the code.
use snprintf() in a loop to generate the filename.(Runtime). This can be used dynamically based on some algorithm for measuring the file size.
That said, best way : use snprintf().

Maximum size array program in C?

with the following code, I am trying to make an array of numbers and then sorting them. But if I set a high arraysize (MAX), the program stops at the last 'randomly' generated number and does not continue to the sorting at all. Could anyone please give me a hand with this?
#include <stdio.h>
#define MAX 2000000
int a[MAX];
int rand_seed=10;
/* from K&R
- returns random number between 0 and 62000.*/
int rand();
int bubble_sort();
int main()
{
int i;
/* fill array */
for (i=0; i < MAX; i++)
{
a[i]=rand();
printf(">%d= %d\n", i, a[i]);
}
bubble_sort();
/* print sorted array */
printf("--------------------\n");
for (i=0; i < MAX; i++)
printf("%d\n",a[i]);
return 0;
}
int rand()
{
rand_seed = rand_seed * 1103515245 +12345;
return (unsigned int)(rand_seed / 65536) % 62000;
}
int bubble_sort(void)
{
int t, x, y;
/* bubble sort the array */
for (x=0; x < MAX-1; x++)
for (y=0; y < MAX-x-1; y++)
if (a[y] > a[y+1])
{
t=a[y];
a[y]=a[y+1];
a[y+1]=t;
}
return 0;
}

The problem is that you are storing the array in global section, C doesn't give any guarantee about the maximum size of global section it can support, this is a function of OS, arch compiler.
So instead of creating a global array, create a global C pointer, allocated a large chunk using malloc. Now memory is saved in the heap which is much bigger and can grow at runtime.

Your array will land in BSS section for static vars. It will not be part of an image but program loader will allocate required space and fill it with zeros before your program starts 'real' execution. You can even control this process if using embedded compiler and fill your static data with anything you like. This array may occupy 2GB or your RAM and yet your exe file may be few kilobytes. I've just managed to use over 2GB array this way and my exe was 34KB. I can believe a compiler may warn you when you approach maybe 231-1 elements (if your int is 32bit) but static arrays with 2m elements are not a problem nowadays (unless it is embedded system but I bet it is not).
The problem might be that your bubble sort has 2 nested loops (as all bubble sorts) so trying to sort this array - having 2m elements - causes the program to loop 2*1012 times (arithmetic sequence):
inner loop:
1: 1999999 times
2: 1999998 times
...
2000000: 1 time
So you must swap elements
2000000 * (1999999+1) / 2 = (4 / 2) * 10000002 = 2*1012 times
(correct me if I am wrong above)
Your program simply remains too long in sort routine and you are not even aware of that. What you see it just last rand number printed and program not responding. Even on my really fast PC with 200K array it took around 1minute to sort it this way.
It is not related to your os, compiler, heaps etc. Your program is just stuck as your loop executes 2*1012 times if you have 2m elements.
To verify my words print "sort started" before sorting and "sort finished" after that. I bet the last thing you'll see is "sort started". In addition you may print current x value before your inner loop in bubble_sort - you'll see that it is working.

Dynamic Array
int *Array;
Array= malloc (sizeof(int) * Size);

The original C standard (ANSI 1989/ISO 1990) required that a compiler successfully translate at least one program containing at least one example of a set of environmental limits. One of those limits was being able to create an object of at least 32,767 bytes.
This minimum limit was raised in the 1999 update to the C standard to be at least 65,535 bytes.
No C implementation is required to provide for objects greater than that size, which means that they don't need to allow for an array of ints greater than
(int)(65535 / sizeof(int)).
In very practical terms, on modern computers, it is not possible to say in advance how large an array can be created. It can depend on things like the amount of physical memory installed in the computer, the amount of virtual memory provided by the OS, the number of other tasks, drivers, and programs already running and how much memory that are using. So your program may be able to use more or less memory running today than it could use yesterday or it will be able to use tomorrow.
Many platforms place their strictest limits on automatic objects, that is those defined inside of a function without the use of the 'static' keyword. On some platforms you can create larger arrays if they are static or by dynamic allocation.

Best way to convert whole file to lowercase in C

I was wondering if theres a realy good (performant) solution how to Convert a whole file to lower Case in C.
I use fgetc convert the char to lower case and write it in another temp-file with fputc. At the end i remove the original and rename the tempfile to the old originals name. But i think there must be a better Solution for it.

This doesn't really answer the question (community wiki), but here's an (over?)-optimized function to convert text to lowercase:
#include <assert.h>
#include <ctype.h>
#include <stdio.h>
int fast_lowercase(FILE *in, FILE *out)
{
char buffer[65536];
size_t readlen, wrotelen;
char *p, *e;
char conversion_table[256];
int i;
for (i = 0; i < 256; i++)
conversion_table[i] = tolower(i);
for (;;) {
readlen = fread(buffer, 1, sizeof(buffer), in);
if (readlen == 0) {
if (ferror(in))
return 1;
assert(feof(in));
return 0;
}
for (p = buffer, e = buffer + readlen; p < e; p++)
*p = conversion_table[(unsigned char) *p];
wrotelen = fwrite(buffer, 1, readlen, out);
if (wrotelen != readlen)
return 1;
}
}
This isn't Unicode-aware, of course.
I benchmarked this on an Intel Core 2 T5500 (1.66GHz), using GCC 4.6.0 and i686 (32-bit) Linux. Some interesting observations:
It's about 75% as fast when buffer is allocated with malloc rather than on the stack.
It's about 65% as fast using a conditional rather than a conversion table.

I'd say you've hit the nail on the head. Temp file means that you don't delete the original until you're sure that you're done processing it which means upon error the original remains. I'd say that's the correct way of doing it.
As suggested by another answer (if file size permits) you can do a memory mapping of the file via the mmap function and have it readily available in memory (no real performance difference if the file is less than the size of a page as it's probably going to get read into memory once you do the first read anyway)

You can usually get a little bit faster on big inputs by using fread and fwrite to read and write big chunks of the input/output. Also you should probably convert a bigger chunk (whole file if possible) into memory and then write it all at once.
edit: I just rememberd one more thing. Sometimes programs can be faster if you select a prime number (at the very least not a power of 2) as the buffer size. I seem to recall this has to do with specifics of the cacheing mechanism.

If you're processing big files (big as in, say, multi-megabytes) and this operation is absolutely speed-critical, then it might make sense to go beyond what you've inquired about. One thing to consider in particular is that a character-by-character operation will perform less well than using SIMD instructions.
I.e. if you'd use SSE2, you could code the toupper_parallel like (pseudocode):
for (cur_parallel_word = begin_of_block;
cur_parallel_word < end_of_block;
cur_parallel_word += parallel_word_width) {
/*
* in SSE2, parallel compares are either about 'greater' or 'equal'
* so '>=' and '<=' have to be constructed. This would use 'PCMPGTB'.
* The 'ALL' macro is supposed to replicate into all parallel bytes.
*/
mask1 = parallel_compare_greater_than(*cur_parallel_word, ALL('A' - 1));
mask2 = parallel_compare_greater_than(ALL('Z'), *cur_parallel_word);
/*
* vector op - and all bytes in two vectors, 'PAND'
*/
mask = mask1 & mask2;
/*
* vector op - add a vector of bytes. Would use 'PADDB'.
*/
new = parallel_add(cur_parallel_word, ALL('a' - 'A'));
/*
* vector op - zero bytes in the original vector that will be replaced
*/
*cur_parallel_word &= !mask; // that'd become 'PANDN'
/*
* vector op - extract characters from new that replace old, then or in.
*/
*cur_parallel_word |= (new & mask); // PAND / POR
}
I.e. you'd use parallel comparisons to check which bytes are uppercase, and then mask both original value and 'uppercased' version (one with the mask, the other with the inverse) before you or them together to form the result.
If you use mmap'ed file access, this could even be performed in-place, saving on the bounce buffer, and saving on many function and/or system calls.
There is a lot to optimize when your starting point is a character-by-character 'fgetc' / 'fputc' loop; even shell utilities are highly likely to perform better than that.
But I agree that if your need is very special-purpose (i.e. something as clear-cut as ASCII input to be converted to uppercase) then a handcrafted loop as above, using vector instruction sets (like SSE intrinsics/assembly, or ARM NEON, or PPC Altivec), is likely to make a significant speedup possible over existing general-purpose utilities.

Well, you can definitely speed this up a lot, if you know what the character encoding is. Since you're using Linux and C, I'm going to go out on a limb here and assume that you're using ASCII.
In ASCII, we know A-Z and a-z are contiguous and always 32 apart. So, what we can do is ignore the safety checks and locale checks of the toLower() function and do something like this:
(pseudo code)
foreach (int) char c in the file:
c -= 32.
Or, if there may be upper and lowercase letters, do a check like
if (c > 64 && c < 91) // the upper case ASCII range
then do the subtract and write it out to the file.
Also, batch writes are faster, so I would suggest first writing to an array, then all at once writing the contents of the array to the file.
This should be considerable faster.

Pointer becomes nothing for no apparent reason

Greetings!
I have a simple program in qt on c.
There are two pointers to type short, used to read from file and store bits from values read.
sample code:
//(input is a FILE* which is opened and passed to the function)
//(output is also a FILE* which is also opened and passed to the function)
//1. Variables declaration
short* sample_buffer;
int buffer_size=1;
short samples_read;
unsigned long value_x=7;
short* nzb_buffer;
short buffer_position=-1;
int i;
//2.Memory allocation
sample_buffer= malloc(sizeof(short)*buffer_size);
nzb_buffer = malloc(sizeof(short)*value_x);
....
//3. Read from infile, one short at time, process and write it to outfile
do
{
//3.1. Read from input file
samples_read = fread(sample_buffer,sizeof(short),buffer_size, input);
//3.2. Switch position inside nzb_buffer one to the right,
// going back to zero if out of bounds
buffer_position=(buffer_position+1)%value_x;
....
//3.3. Put least significant bit of the just read short into nzb_buffer
nzb_buffer[buffer_position]=sample_buffer[0]%2;
....
//3.4. Write the short we just read from infile to the outfile
for (i=0;i<samples_read;i++)
{
fwrite(sample_buffer,sizeof(short),1, output);
}
} while(samples_read==buffer_size);
I've let unreliant pieces of code out. If you need to see something else please tell me.
Problem is, after like 10 or 15 operations of the loop, it crashes with "Segmentation fault" signal. It crashes on the fwrite() function.
I debugged and i use watch on sample_buffer. For some reason, on one exact step, the operation nzb_buffer[buffer_position]=sample_buffer[0]%2 makes sample_buffer become 0x0 (i belive, it becomes a null pointer).
This cannot be overflowing on nzb_buffer because buffer_position for that operation is 3 (out of 7 allocated for the particular array in malloc). And since each loop makes one write operation and shifts the carry, the operation of writing into nzb_buffer[3] has already happened before in the loop and did not nullify the pointer that time.
I am totally clueless what may be happening here.
Anybody has any ideas what is going on or how do i debug it?
Thanks in advance!
PS: Added comments "what the code does"

Your exit condition for the loop seems to be misplaced. I would do:
samples_read = fread(sample_buffer,sizeof(short),buffer_size, input);
while(samples_read==buffer_size){
[...]
samples_read = fread(sample_buffer,sizeof(short),buffer_size, input);
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

performance - read huge FASTA file line by line in C - c

I have found the issue. I didn't notice before but in my code there were two loops that actually run to the size of file and don't needed (That,s why I got the variable time for each iteration too). I just eliminate them and now it work perfect.

to improve the speed you can also move the malloc line before the 'while', and the free after the end of the 'while'.

Related

C freads function returns wrong file binary data [closed]

Automatically appending numbers to variables in C

Maximum size array program in C?

Best way to convert whole file to lowercase in C

Pointer becomes nothing for no apparent reason

Categories

Resources