read multiple fasta sequence using external library kseq.h - c

I am trying to find fasta sequences of 5 ids/name as provided by user from a big fasta file (containing 80000 fasta sequences) using an external header file kseq.h as in: http://lh3lh3.users.sourceforge.net/kseq.shtml. When I run the program in a for loop, I have to open/close the big fasta file again and again (commented in the code) which makes the computation time slow. On the contrary, if I open/close only once outside the loop, the program stops if it encounters an entry which is not present in the big fasta file I.e. it reaches end of the file. Can anyone suggest how to get all the sequences without losing computational time. The code is:
#include <zlib.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include "ext_libraries/kseq.h"
KSEQ_INIT(gzFile, gzread)
int main(int argc, char *argv[])
{
char gwidd_ids[100];
kseq_t *seq;
int i=0, nFields=0, row=0, col=0;
int size=1000, flag1=0, l=0, index0=0;
printf("Opening file %s\n", argv[1]);
char **gi_ids=(char **)malloc(sizeof(char *)*size);
for(i=0;i<size;i++)
{
gi_ids[i]=(char *)malloc(sizeof(char)*50);
}
FILE *fp_inp = fopen(argv[1], "r");
while(fscanf(fp_inp, "%s", gwidd_ids) == 1)
{
printf("%s\n", gwidd_ids);
strcpy(gi_ids[index0], gwidd_ids);
index0++;
}
fclose(fp_inp);
FILE *f0 = fopen("xxx.txt", "w");
FILE *f1 = fopen("yyy.txt", "w");
FILE *f2 = fopen("zzz", "w");
FILE *instream = NULL;
instream = fopen("fasta_seq_uniprot.txt", "r");
gzFile fpf = gzdopen(fileno(instream), "r");
for(col=0;col<index0;col++)
{
flag1=0;
// FILE *instream = NULL;
// instream = fopen("fasta_seq_nr_uniprot.txt", "r");
// gzFile fpf = gzdopen(fileno(instream), "r");
kseq_t *seq = kseq_init(fpf);
while((kseq_read(seq)) >= 0 && flag1 == 0)
{
if(strcasecmp(gi_ids[col], seq->name.s) == 0)
{
fprintf(f1, ">%s\n", gi_ids[col]);
fprintf(f2, ">%s\n%s\n", seq->name.s, seq->seq.s);
flag1 = 1;
}
}
if(flag1 == 0)
{
fprintf(f0, "%s\n", gi_ids[col]);
}
kseq_destroy(seq);
// gzclose(fpf);
}
gzclose(fpf);
fclose(f0);
fclose(f1);
fclose(f2);
for(i=0;i<size;i++)
{
free(gi_ids[i]);
}
free(gi_ids);
return 0;
}
A few examples of inputfile (fasta_seq_uniprot.txt) is:
P21306
MSAWRKAGISYAAYLNVAAQAIRSSLKTELQTASVLNRSQTDAFYTQYKNGTAASEPTPITK
P38077
MLSRIVSNNATRSVMCHQAQVGILYKTNPVRTYATLKEVEMRLKSIKNIEKITKTMKIVASTRLSKAEKAKISAKKMD
-----------
-----------
The user entry file is
P37592\n
Q8IUX1\n
B3GNT2\n
Q81U58\n
P70453\n

Your problem appears a bit different than you suppose. That the program stops after trying to retrieve a sequence that is not present in the data file is a consequence of the fact that it never rewinds the input. Therefore, even for a query list containing only sequences that are present in the data file, if the requested sequence IDs are not in the same relative order as the data file then the program will fail to find some of the sequences (it will pass them by when looking for an earlier-listed sequence, never to return).
Furthermore, I think it likely that the time savings you observe comes from making only a single pass through the file, instead of a (partial) pass for each requested sequence, not so much from opening it only once. Opening and closing a file is a bit expensive, but nowhere near as expensive as reading tens or hundreds of kilobytes from it.
To answer your question directly, I think you need to take these steps:
Move the kseq_init(seq) call to just before the loop.
Move the kseq_destroy(seq) call to just after the loop.
Put in a call to kseq_rewind(seq) as the last statement in the loop.
That should make your program right again, but it is likely to kill pretty much all your time savings, because you will return to scanning the file from the beginning for each requested sequence.
The library you are using appears to support only sequential access. Therefore, the most efficient way to do the job both right and fast would be to invert the logic: read sequences one at a time in an outer loop, testing each one as you go to see whether it matches any of the requested ones.
Supposing that the list of requested sequences will contain only a few entries, like your example, you probably don't need to do any better testing for matches than just using an inner loop to test each requested sequence id vs. the then-current sequence. If the query lists may be a lot longer, though, then you could consider putting them in a hash table or sorting them into the same order as the data file to make it possible to test more efficiently for matches.

Related

What do I need to do to read a file then pick a line and write it to another file (Using C)?

I've been trying to figure out how I would, read a .txt file, and pick a line of said file from random then write the result to a different .txt file
for example:
.txt
bark
run
car
take line 2 and 3 add them together and write it to Result.txt on a new line.
How would I go about doing this???
I've tried looking around for resources for fopen(), fgets(), fgetc(), fprintf(), puts(). Haven't found anything so far on reading a line that isn't the first line, my best guess:
-read file
-print line of file in memory I.E. an array
-pick a number from random I.E. rand()
-use random number to pick a array location
-write array cell to new file
-repeat twice
-make newline repeat task 4-6
-when done
-close read file
-close write file
Might be over thinking it or just don't know what the operation to get a single line anywhere in a file is.
just having a hard time rapping my head around it.
I'm not going to solve the whole exercise, but I will give you a hint on how to copy a line from one file to another.
You can use fgets and increment a counter each time you find a line break, if the line number is the one you want to copy, you simply dump the buffer obtained with fgets to the target file with fputs.
#include <stdio.h>
#include <string.h>
int main(void)
{
// I omit the fopen check for brevity
FILE *in = fopen("demo.c", "r");
FILE *out = fopen("out.txt", "w");
int ln = 1, at = 4; // copy line 4
char str[128];
while (fgets(str, sizeof str, in))
{
if (ln == at)
{
fputs(str, out);
}
if (strchr(str, '\n') && (ln++ == at))
{
break;
}
}
fclose(in);
fclose(out);
return 0;
}
Output:
int main(void)

Replacing bytes at current offset in c

I'm currently developing a program that mimics UNIX file system. I've prepared my disk as file (1 MB) got all data blocks inside it. Now what I'm doing is implementing some simple commands like mkdir, ls etc. In order to work with those commands, I need to read specific offset(no problem with that) and write the modified blocks to specific location.
Simply my goal is:
SIIIDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD (Current Disk)
I wan't to change three blocks with AAA after 16.byte so it will be like:
SIIIDDDDDDDDDDDDAAADDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD (Modified Disk)
I'm not going to provide all of my implementation here I just want to have some ideas about it how can I implement it without buffering all the 1 MB data in my program. In short I know locations of my data blocks so I just want to replace that part of my file not whole file. Can't I simply do this with file stream functions ?
Another example:
fseek(from_disk,superblock.i_node_bit_map_starting_addr , SEEK_SET); //seek to known offset.
read_bit_map(&from_disk); // I can read at specific location without problem
... manipulate bit map ...
fseek(to_disk,superblock.i_node_bit_map_starting_addr , SEEK_SET); //seek to known offset.
write_bit_map(&to_disk); //Write back the data.
//This will destroy the current data of file. (Tried with w+, a modes.)
Note: Not provided in example but I have two file pointers both writing and reading and I'm aware I need to close one before opening another.
I think you are looking for the r+ (potentially rb+ mode). Here is a complete example, afterwards you can run grep -n hello data.txt to verify for yourself the result. You can run it with make prog && ./prog.
#include <stdio.h>
#include <unistd.h>
#include <string.h>
int main(int argc, char const *argv[])
{
FILE *file;
file = fopen("data.txt", "w+");
char dummy_data[] = "This is stackoverflow.com\n";
int dummy_data_length = strlen(dummy_data);
for (int i = 0; i < 1000; ++i)
fwrite(dummy_data, dummy_data_length, 1, file);
fclose(file);
file = fopen("data.txt", "r+");
fseek(file, 500, SEEK_CUR);
fwrite("hello", 5, 1, file);
fclose(file);
return 0;
}

Reading all files in two directories at the same time

I have a problem with task. I have two path to directories. I can read all files from first path in argv[1] but can't open files from second folder from argv[2]. Quantity of files is equal. The way at the begining to write name of fales in array is failed because their is about a few hundred.I have an example how I try reading files. Need help. Thanks!
#include "stdafx.h"
#include "windows.h"
int main(int argc, char* argv[])
{
FILE *fp = 0;
uchar tmpl1[BUFFER_SIZE] = { 0 };
uchar tmpl2[BUFFER_SIZE] = { 0 };
size_t size;
size_t n;
FILE *Fl = 0;
if (argc != 3 || argv[1] == NULL || argv[2] == NULL)
{
printf("Error", argv[0]);
return -1;
}
char Fn[255];
HANDLE hFind;
WIN32_FIND_DATA ff;
char Fn1[255];
HANDLE hFind1;
WIN32_FIND_DATA ff1;
sprintf_s(Fn, 255, "%s\\*", argv[1]);
sprintf_s(Fn1, 255, "%s\\*", argv[2]);
if ((hFind = FindFirstFile(Fn, &ff)) != INVALID_HANDLE_VALUE)
{
if ((hFind1 = FindFirstFile(Fn1, &ff1)) != INVALID_HANDLE_VALUE)
{
do
{
if (ff.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) continue;
ff1.dwFileAttributes;
sprintf_s(Fn, "%s\\%s", argv[1], ff.cFileName);
sprintf_s(Fn1, "%s\\%s", argv[2], ff1.cFileName);
// here I can't read file's name from second folder
printf(Fn, "%s\\%s", argv[1], ff.cFileName);
printf(Fn1, "%s\\%s", argv[2], ff1.cFileName);
if (fopen_s(&fp, Fn, "rb") != 0)
{
printf("Error reading\nUsage: %s <tmpl1>\n", argv[1]);
return -1;
}
size = _filelength(_fileno(fp));
n = fread(tmpl1, size, 1, fp);
fclose(fp);
fp = 0;
} while (FindNextFile(hFind, &ff));
// also I have a problem how read next file in second directory
FindClose(hFind);
}
}
return 0;
}
I didn't read why you want to scan two directories concurrently.
When I saw "at the same time" in the title I thought "concurrently". Afterwards, I saw the presented code and realized it shall be done rather "interleaved" instead of "concurrently" but that's not essential.
I assume you want to associate the file names in the first directory somehow to the file names in the second directory. This might be comparing the file names, read data from a file of first directory and read other data from an associated file of second directory, or may be something completely different.
However, based on this assumption, you have to consider that:
You should not assume to get file names in any useful order when scanning with FindFirstFile()/FindNextFile(). These functions return the files in its "physical order" i.e. how they are listed internally. (At best, you get . and .. always as first entries but I even wouldn't count on this.)
Considering this, I would recommend the following procedure:
read file names from first directory and store them in an array names1
read file names from second directory and store them in an array names2
sort arrays names1 and names2 with an appropriate criterion (e.g. lexicographically)
process the arrays names1 and names2.
As you see, the "read file names from directory and store them in an array" could be implemented as function and re-used as well as the sorting.
This said, finally, the answer for how to interleave two directory scans:
HANDLE hFind1 = FindFirstFile(Fn1, &ff1);
HANDLE hFind2 = FindFirstFile(Fn2, &ff2);
while (hFind1 != INVALID_HANDLE_VALUE || hFind2 != INVALID_HANDLE_VALUE) {
if (hFind1 != INVALID_HANDLE_VALUE) {
/** #todo process ff1 somehow */
}
if (hFind2 != INVALID_HANDLE_VALUE) {
/** #todo process ff2 somehow */
}
/* iterate: */
if (!FindNextFile(hFind1, &ff1)) {
FindClose(hFind1); hFind1 = INVALID_HANDLE_VALUE;
}
if (!FindNextFile(hFind2, &ff2)) {
FindClose(hFind2); hFind2 = INVALID_HANDLE_VALUE;
}
}
Please, note that I "abuse" the handles hFind1 and hFind2 itself for loop repetition. Thus, I do not need the extra ifs. (I like things like that.)
Btw. this loop iterates until both directories are scanned completely (even if they don't contain the same number of entries).
If you want to iterate instead until at least one directory is scanned completely you may achieve this by simply changing the while condition to:
while (hFind1 != INVALID_HANDLE_VALUE && hFind2 != INVALID_HANDLE_VALUE) {
if the loop shall be terminated as soon as at least one directory scan fails.
At last, a little story out of my own past (where I learnt a useful lesson regarding this):
I just had finished my study (of computer science) and was working at home on a rather fresh installed Windows NT when I started to copy a large directory from a CD drive to harddisk. The estimated time was round-about 1 hour and I thought: 'Hey. It does multi-tasking!' Thus, I started a second File Manager to copy another directory from this CD drive concurrently. When I hit the OK button, the prompt noises of the CD drive alerted me as well as the estimated time which "exploded" to multiple hours. After that, I behaved like to expect: tapped on my forehead and mumbled something like "unshareable resources"... (and, of course, stopped the second copying and went for a coffee instead.)

Append Random Text without Repetition for File (C)

I have 5 list of name
char *name[] = {"a","b","c","d","e"};
and I have 3 files
char path1[PATH_MAX+1]
snprintf(path1, PATH_MAX+1, "%sfile1.txt",dirname);
FILES *filename1 = fopen(path1, "w")
.
.
.
char path3[PATH_MAX+1]
snprintf(path3, PATH_MAX+1, "%sfile3.txt",dirname);
FILES *filename3 = fopen(path3, "w")
What I want is to randomly append a,b,c,d,e (one of them per file) into three of those files without repetition.
What I have right now is (example from one of them)
srand(time(NULL));
int one = rand()%5;
char path1[PATH_MAX+1];
snprintf(path1, PATH_MAX+1, "%sfile1.txt",dirname);
FILES *filename1 = fopen(path1, "w");
fputs(name[one],filename1);
fclose(filename1);
However, sometimes it is still possible where my file1.txt and file3.txt both contain b (same alphabet from name)
Questions
Did I miss something to make sure that all the random result always unique?
Is it also efficient tho to have 6 lines of code to create one file and append a random name inside it? I'm just wondering if I have to create like 20 files, I will write 120 lines that basically almost the same, just different in number (filename1 to filename3)
Thank you.
To get a unique sequence of characters, you can draw them from a diminishing pool. For example, after you have picked the a, it is removed from the pool. Of course, the pool must be at least as big as the number of files you want to print.
A simple way to implement this sort of picking is to pick a char from the pool, move the last character from the pool to the place of the picked character and decrease the pool size by one.
If you see a lot of repetition of code, especially if the only difference is a variable name along the lines of filename1, filename2, filename3 and so on should ring a bell that you should use an array: FILE *file[NFILE]. Be aware, though, that you can only have a certain number of files open at a time.
In your case, you want to write a single character to a file. There's no need to have multiple file s open simultaneously: Open a file, pick a char, write it to the file, close e file. Then process the next file.
The program below does what you want, I think.
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define NFILES 10
int main()
{
char pool[] = "abcdefghij"; // Available chars as array
int npool = sizeof(pool) - 1; // Pool size minus terminating '\0'
int i;
if (npool < NFILES) {
fprintf(stderr,
"Not enough letters in pool for %d files.\n", NFILES);
exit(1);
}
srand(time(NULL));
for (i = 0; i < NFILES; i++) {
int ipick = rand() % npool; // position of pick
char pick = pool[ipick]; // picked char
char fname[20];
FILE *f;
pool[ipick] = pool[--npool]; // remove pick from pool
snprintf(fname, sizeof(fname), "file-%03d.txt", i);
f = fopen(fname, "w");
if (f == NULL) {
fprintf(stderr, "Could not create %s.\n", fname);
exit(1);
}
fprintf(f, "%c\n", pick);
fclose(f);
}
return 0;
}

easy way to parse a text file?

I'm making a load balancer (a very simple one). It looks at how long the user has been idle, and the load on the system to determine if a process can run, and it goes through processes in a round-robin fashion.
All of the data needed to control the processes are stored in a text file.
The file might look like this:
PID=4390 IDLE=0.000000 BUSY=2.000000 USER=2.000000
PID=4397 IDLE=3.000000 BUSY=1.500000 USER=4.000000
PID=4405 IDLE=0.000000 BUSY=2.000000 USER=2.000000
PID=4412 IDLE=0.000000 BUSY=2.000000 USER=2.000000
PID=4420 IDLE=3.000000 BUSY=1.500000 USER=4.000000
This is a university assignment, however parsing the text file isn't supposed to be a big part of it, which means I can use whatever way is the quickest for me to implement.
Entries in this file will be added and removed as processes finish or are added under control.
Any ideas on how to parse this?
Thanks.
Here is a code that will parse your file, and also account for the fact that your file might be unavailable (that is, fopen might fail), or being written while you read it (that is, fscanf might fail). Note that infinite loop, which you might not want to use (that's more pseudo-code than actual code to be copy-pasted in your project, I didn't try to run it). Note also that it might be quite slow given the duration of the sleep there: you might want to use a more advanced approach, that's more sort of a hack.
int pid;
float idle, busy, user;
FILE* fid;
fpos_t pos;
int pos_init = 0;
while (1)
{
// try to open the file
if ((fid = fopen("myfile.txt","rw+")) == NULL)
{
sleep(1); // sleep for a little while, and try again
continue;
}
// reset position in file (if initialized)
if (pos_init)
fsetpos (pFile,&pos);
// read as many line as you can
while (!feof(fid))
{
if (fscanf(fid,"PID=%d IDLE=%f BUSY=%f USER=%f",&pid, &idle, &busy, &user))
{
// found a line that does match this pattern: try again later, the file might be currently written
break;
}
// add here your code processing data
fgetpos (pFile,&pos); // remember current position
pos_init = 1; // position has been initialized
}
fclose(fid);
}
As far as just parsing is concerned, something like this in a loop:
int pid;
float idle, busy, user;
if(fscanf(inputStream, "PID=%d IDLE=%f BUSY=%f USER=%f", %pid, &idle, &busy, &user)!=4)
{
/* handle the error */
}
But as #Blrfl pointed out, the big problem is to avoid mixups when your application is reading the file and the others are writing to it. To solve this problem you should use a lock or something like that; see e.g. the flock syscall.
Use fscanf in a loop. Here's a GNU C tutorial on using fscanf.
/* fscanf example */
#include <stdio.h>
typedef struct lbCfgData {
int pid;
double idle;
double busy;
double user;
} lbCfgData_t ;
int main ()
{
// PID=4390 IDLE=0.000000 BUSY=2.000000 USER=2.000000
lbCfgData_t cfgData[128];
FILE *f;
f = fopen ("myfile.txt","rw+");
for ( int i = 0;
i != 128 // Make sure we don't overflow the array
&& fscanf(f, "PID=%u IDLE=%f BUSY=%f USER=%f", &cfgData[i].pid,
&cfgData[i].idle, &cfgData[i].busy, cfgData[i].user ) != EOF;
i++
);
fclose (f);
return 0;
}

Resources