Append Random Text without Repetition for File (C) - c

I have 5 list of name
char *name[] = {"a","b","c","d","e"};
and I have 3 files
char path1[PATH_MAX+1]
snprintf(path1, PATH_MAX+1, "%sfile1.txt",dirname);
FILES *filename1 = fopen(path1, "w")
.
.
.
char path3[PATH_MAX+1]
snprintf(path3, PATH_MAX+1, "%sfile3.txt",dirname);
FILES *filename3 = fopen(path3, "w")
What I want is to randomly append a,b,c,d,e (one of them per file) into three of those files without repetition.
What I have right now is (example from one of them)
srand(time(NULL));
int one = rand()%5;
char path1[PATH_MAX+1];
snprintf(path1, PATH_MAX+1, "%sfile1.txt",dirname);
FILES *filename1 = fopen(path1, "w");
fputs(name[one],filename1);
fclose(filename1);
However, sometimes it is still possible where my file1.txt and file3.txt both contain b (same alphabet from name)
Questions
Did I miss something to make sure that all the random result always unique?
Is it also efficient tho to have 6 lines of code to create one file and append a random name inside it? I'm just wondering if I have to create like 20 files, I will write 120 lines that basically almost the same, just different in number (filename1 to filename3)
Thank you.

To get a unique sequence of characters, you can draw them from a diminishing pool. For example, after you have picked the a, it is removed from the pool. Of course, the pool must be at least as big as the number of files you want to print.
A simple way to implement this sort of picking is to pick a char from the pool, move the last character from the pool to the place of the picked character and decrease the pool size by one.
If you see a lot of repetition of code, especially if the only difference is a variable name along the lines of filename1, filename2, filename3 and so on should ring a bell that you should use an array: FILE *file[NFILE]. Be aware, though, that you can only have a certain number of files open at a time.
In your case, you want to write a single character to a file. There's no need to have multiple file s open simultaneously: Open a file, pick a char, write it to the file, close e file. Then process the next file.
The program below does what you want, I think.
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define NFILES 10
int main()
{
char pool[] = "abcdefghij"; // Available chars as array
int npool = sizeof(pool) - 1; // Pool size minus terminating '\0'
int i;
if (npool < NFILES) {
fprintf(stderr,
"Not enough letters in pool for %d files.\n", NFILES);
exit(1);
}
srand(time(NULL));
for (i = 0; i < NFILES; i++) {
int ipick = rand() % npool; // position of pick
char pick = pool[ipick]; // picked char
char fname[20];
FILE *f;
pool[ipick] = pool[--npool]; // remove pick from pool
snprintf(fname, sizeof(fname), "file-%03d.txt", i);
f = fopen(fname, "w");
if (f == NULL) {
fprintf(stderr, "Could not create %s.\n", fname);
exit(1);
}
fprintf(f, "%c\n", pick);
fclose(f);
}
return 0;
}

Related

Check multiple files with "strstr" and "fopen" in C

Today I decided to learn to code for the first time in my life. I decided to learn C. I have created a small program that checks a txt file for a specific value. If it finds that value then it will tell you that that specific value has been found.
What I would like to do is that I can put multiple files go through this program. I want this program to be able to scan all files in a folder for a specific string and display what files contain that string (basically a file index)
I just started today and I'm 15 years old so I don't know if my assumptions are correct on how this can be done and I'm sorry if it may sound stupid but I have been thinking of maybe creating a thread for every directory I put into this program and each thread individually runs that code on the single file and then it displays all the directories in which the string can be found.
I have been looking into threading but I don't quite understand it. Here's the working code for one file at a time. Does anyone know how to make this work as I want it?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
//searches for this string in a txt file
char searchforthis[200];
//file name to display at output
char ch, file_name[200];
FILE *fp;
//Asks for full directory of txt file (example: C:\users\...) and reads that file.
//fp is content of file
printf("Enter name of a file you wish to check:\n");
gets(file_name);
fp = fopen(file_name, "r"); // read mode
//If there's no data inside the file it displays following error message
if (fp == NULL)
{
perror("Error while opening the file.\n");
exit(EXIT_FAILURE);
}
//asks for string (what has to be searched)
printf("Enter what you want to search: \n");
scanf("%s", searchforthis);
char* p;
// Find first occurrence of searchforthis in fp
p = strstr(searchforthis, fp);
// Prints the result
if (p) {
printf("This Value was found in following file:\n%s", file_name);
} else
printf("This Value has not been found.\n");
fclose(fp);
return 0;
}
This line,
p = strstr(searchforthis, fp);
is wrong. strstr() is defined as, char *strstr(const char *haystack, const char *needle), no file pointers in it.
Forget about gets(), its prone to overflow, reference, Why is the gets function so dangerous that it should not be used?.
Your scanf("%s",...) is equally dangerous to using gets() as you don't limit the character to be read. Instead, you could re-format it as,
scanf("%199s", searchforthis); /* 199 characters + \0 to mark the end of the string */
Also check the return value of scanf() , in case an input error occurs, final code should look like this,
if (scanf("%199s", searchforthis) != 1)
{
exit(EXIT_FAILURE);
}
It is even better, if you use fgets() for this, though keep in mind that fgets() will also save the newline character in the buffer, you are going to have to strip it manually.
To actually perform checks on the file, you have to read the file line by line, by using a function like, fgets() or fscanf(), or POSIX getline() and then use strstr() on each line to determine if you have a match or not, something like this should work,
char *p;
char buff[500];
int flag = 0, lines = 1;
while (fgets(buff, sizeof(buff), fp) != NULL)
{
size_t len = strlen(buff); /* get the length of the string */
if (len > 0 && buff[len - 1] == '\n') /* check if the last character is the newline character */
{
buff[len - 1] = '\0'; /* place \0 in the place of \n */
}
p = strstr(buff, searchforthis);
if (p != NULL)
{
/* match - set flag to 1 */
flag = 1;
break;
}
}
if (flag == 0)
{
printf("This Value has not been found.\n");
}
else
{
printf("This Value was found in following file:\n%s", file_name);
}
flag is used to determine whether or not searchforthis exists in the file.
Side note, if the line contains more than 499 characters, you will need a larger buffer, or a different function, consider getline() for that case, or even a custom one reading character by character.
If you want to do this for multiple files, you have to place the whole process in a loop. For example,
for (int i = 0; i < 5; i++) /* this will execute 5 times */
{
printf("Enter name of a file you wish to check:\n");
...
}

What is the fastest way to read several lines of data from a large file

My application needs to read like thousands of lines from a large csv file around 300GB with billion lines, each line contains several numbers. The data are like these:
1, 34, 56, 67, 678, 23462, ...
2, 3, 6, 8, 34, 5
23,547, 648, 34657 ...
...
...
I tried fget reading file line by line in c, but it took really really really long, even with wc -l in linux, just to read all of the line, it took quite a while.
I also tried to write all data to sqlite3 database based on the logics of the application. However, the data structure is different than the csv file above, which now has 100 billion lines, with only two numbers each line. I then created two indices on top of them, which resulted a 2.5TB database, while it was 1 TB without indices before. Since the scale of indices are large than data, query has to read the whole 1.5 TB indices, I think it doesn't make any sense to use database method right?
So I would like to ask, what is the quickest way to read several lines within a large csv file with billion lines in C or python. And by the way, is there any formula or something to calculate the time consume between reading file and capacity of RAM.
environment: linux, RAM 200GB, C, python
Requirements
huge csv file, several hundred GB in size
each line contains several numbers
the program must extract several thousand lines per run
the program works several times with the same file, only different lines should be extracted
Since lines in the csv files have a variable length, you would have to read the entire file to get the data of the required lines. Sequential reading of the entire file would still be very slow - even if you optimized the file reading as much as possible. A good indicator is actually the runtime of wc -l, as mentioned already by the OP in the question.
Instead, one should optimize on the algorithmic level. A one-time preprocessing of the data is necessary, which then allows fast access to certain lines - without reading the whole file.
There are several possible ways, for example:
Using a database with an index
programmatic creation of an index file (association of line numbers with file offsets)
convert the csv file into a binary file with fixed format
The OP test shows that approach 1) led to 1.5 TB indices. Method 2), to create a small program that connects the line number with a file offset is certainly also a possibility. Finally, approach 3 would allow to calculate the file offset to a line number without the need for a separate index file. This approach is especially useful if the maximum number of numbers per line is known. Otherwise, approach 2 and approach 3 are very similar.
Approach 3 is explained in more detail below. There may be additional requirements that require the approach to be slightly modified, but the following should get things started.
A one-time pre-processing is necessary. The textual csv lines are converted into int arrays and use a fixed record format to store the ints in binary format in a separate file. To then read a particular line n, you can simply calculate the file offset, e.g. with line_nr * (sizeof(int) * MAX_NUMBERS_PER_LINE);. Finally, with fseeko(fp, offset, SEEK_SET); jump to this offset and read MAX_NUMBERS_PER_LINE ints. So you only need to read the data that you actually want to process.
This has not only the advantage that the program runs much faster, it also requires very little main memory.
Test case
A test file with 3,000,000,000 lines was created. Each line contains up to 10 random int numbers, separated by a comma.
In this case this gave a csv file with about 342 GB of data.
A quick test with
time wc -l numbers.csv
gives
187.14s user 74.55s system 96% cpu 4:31.48 total
This means that it would take a total of at least 4.5 minutes if a sequential file read approach were used.
For one-time preprocessing, a converter program reads each line and stores 10 binary ints per line. The converted file is called 'numbers_bin'. A quick test with access to the data of 10,000 randomly selected rows:
time demo numbers_bin
gives
0.03s user 0.20s system 5% cpu 4.105 total
So instead of 4.5 minutes, it takes 4.1 seconds for this specific example data. That is more than a factor of 65 faster.
Source Code
This approach may sound more complicated than it actually is.
Let's start with the converter program. It reads the csv file and creates a binary fixed format file.
The interesting part takes place in the function pre_process: there a line is read in a loop with 'getline', the numbers are extracted with 'strtok' and 'strtol' and put into an int array initialized with 0. Finally this array is written to the output file with 'fwrite'.
Errors during the conversion result in a message on stderr and the program is terminated.
convert.c
#include "data.h"
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <limits.h>
static void pre_process(FILE *in, FILE *out) {
int *block = get_buffer();
char *line = NULL;
size_t line_capp = 0;
while (getline(&line, &line_capp, in) > 0) {
line[strcspn(line, "\n")] = '\0';
memset(block, 0, sizeof(int) * MAX_ELEMENTS_PER_LINE);
char *token;
char *ptr = line;
int i = 0;
while ((token = strtok(ptr, ", ")) != NULL) {
if (i >= MAX_ELEMENTS_PER_LINE) {
fprintf(stderr, "too many elements in line");
exit(EXIT_FAILURE);
}
char *end_ptr;
errno = 0;
long val = strtol(token, &end_ptr, 10);
if (val > INT_MAX || val < INT_MIN || errno || *end_ptr != '\0' || end_ptr == token) {
fprintf(stderr, "value error with '%s'\n", token);
exit(EXIT_FAILURE);
}
ptr = NULL;
block[i] = (int) val;
i++;
}
fwrite(block, sizeof(int), MAX_ELEMENTS_PER_LINE, out);
}
free(block);
free(line);
}
static void one_off_pre_processing(const char *csv_in, const char *bin_out) {
FILE *in = get_file(csv_in, "rb");
FILE *out = get_file(bin_out, "wb");
pre_process(in, out);
fclose(in);
fclose(out);
}
int main(int argc, char *argv[]) {
if (argc != 3) {
fprintf(stderr, "usage: convert <in> <out>\n");
exit(EXIT_FAILURE);
}
one_off_pre_processing(argv[1], argv[2]);
return EXIT_SUCCESS;
}
Data.h
A few auxiliary functions are used. They are more or less self-explanatory.
#ifndef DATA_H
#define DATA_H
#include <stdio.h>
#include <stdint.h>
#define NUM_LINES 3000000000LL
#define MAX_ELEMENTS_PER_LINE 10
void read_data(FILE *fp, uint64_t line_nr, int *block);
FILE *get_file(const char *const file_name, char *mode);
int *get_buffer();
#endif //DATA_H
Data.c
#include "data.h"
#include <stdlib.h>
void read_data(FILE *fp, uint64_t line_nr, int *block) {
off_t offset = line_nr * (sizeof(int) * MAX_ELEMENTS_PER_LINE);
fseeko(fp, offset, SEEK_SET);
if(fread(block, sizeof(int), MAX_ELEMENTS_PER_LINE, fp) != MAX_ELEMENTS_PER_LINE) {
fprintf(stderr, "data read error for line %lld", line_nr);
exit(EXIT_FAILURE);
}
}
FILE *get_file(const char *const file_name, char *mode) {
FILE *fp;
if ((fp = fopen(file_name, mode)) == NULL) {
perror(file_name);
exit(EXIT_FAILURE);
}
return fp;
}
int *get_buffer() {
int *block = malloc(sizeof(int) * MAX_ELEMENTS_PER_LINE);
if(block == NULL) {
perror("malloc failed");
exit(EXIT_FAILURE);
}
return block;
}
demo.c
And finally a demo program that reads the data for 10,000 randomly determined lines.
The function request_lines determines 10,000 random lines. The lines are sorted with qsort. The data for these lines is read. Some lines of the code are commented out. If you comment them out, the read data is output to the debug console.
#include "data.h"
#include <stdlib.h>
#include <assert.h>
#include <sys/stat.h>
static int comp(const void *lhs, const void *rhs) {
uint64_t l = *((uint64_t *) lhs);
uint64_t r = *((uint64_t *) rhs);
if (l > r) return 1;
if (l < r) return -1;
return 0;
}
static uint64_t *request_lines(uint64_t num_lines, int num_request_lines) {
assert(num_lines < UINT32_MAX);
uint64_t *request_lines = malloc(sizeof(*request_lines) * num_request_lines);
for (int i = 0; i < num_request_lines; i++) {
request_lines[i] = arc4random_uniform(num_lines);
}
qsort(request_lines, num_request_lines, sizeof(*request_lines), comp);
return request_lines;
}
#define REQUEST_LINES 10000
int main(int argc, char *argv[]) {
if (argc != 2) {
fprintf(stderr, "usage: demo <file>\n");
exit(EXIT_FAILURE);
}
struct stat stat_buf;
if (stat(argv[1], &stat_buf) == -1) {
perror(argv[1]);
exit(EXIT_FAILURE);
}
uint64_t num_lines = stat_buf.st_size / (MAX_ELEMENTS_PER_LINE * sizeof(int));
FILE *bin = get_file(argv[1], "rb");
int *block = get_buffer();
uint64_t *requests = request_lines(num_lines, REQUEST_LINES);
for (int i = 0; i < REQUEST_LINES; i++) {
read_data(bin, requests[i], block);
//do sth with the data,
//uncomment the following lines to output the data to the console
// printf("%llu: ", requests[i]);
// for (int x = 0; x < MAX_ELEMENTS_PER_LINE; x++) {
// printf("'%d' ", block[x]);
// }
// printf("\n");
}
free(requests);
free(block);
fclose(bin);
return EXIT_SUCCESS;
}
Summary
This approach provides much faster results than reading through the entire file sequentially (4 seconds instead of 4.5 minutes per run for the sample data). It also requires very little main memory.
The prerequisite is the one-time pre-processing of the data into a binary format. This conversion is quite time-consuming, but the data for certain rows can be read very quickly afterwards using a query program.

How to compare two text files in C and output the differences in a new text file

I am writing a program that inputs two text files
inputtxt1,
inputtxt2
and output
outputtxt file
In these two files information such as
input txt1
S00111111 5 6-Jul-19 09-Aug-19
S00800000 4 1-Jul-19 30-Aug-19
S00000000 1 1-Jul-19 30-Aug-19
input txt2
S00111111 3 6-Jul-19 09-Aug-19
S00222222 1 20-Jul-19 30-Aug-19
S00000000 1 1-Jul-19 30-Aug-19
I am writing a program to input these two txt files and output the differences in SQL queries and the values inside the bracket will change depends on the differences from these text files.
DELETE FROM TABLE WHERE TABLE=[] AND TABLE=[]
INSERT INTO TABLE (TABLE1,TABLE2,TABLE3,TABLE4) VALUES ([ ],[],'[2019-08-30] 00:00:00','[2019-07-01] 00:00:00');
DELETE FROM TABLE WHERE TABLE=[] AND TABLE=[4]
INSERT INTO TABLE (TABLE,TABLE) VALUES ([],[4]);
I wrote my draft in C so what I did id basically a while loop to read each of the line of the first file and each of the line of the second file and output the query.
Here are my two questions:
First it, unfortunately, output the file SQL 3 times, I think there is something wrong with my while loop.
Secondly, how would I make the program detect that specific character from specific line need to be printed in the query for example number 5 in the first line would detect and add to the value of one of the tables in the query.
/* This program will input two text files, output a text file with the differences*/
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
FILE *in1, *in2, *out;
int a, b;
void main (void)
{
int c;
char* singleline [33];
char* singleline2 [33];
in1 = fopen ("inputtest1.txt", "r"); /* reads from the first file */
in2 = fopen ("inputtest2.txt", "r"); /* reads from the second file */
out = fopen ("outputtest3", "w"); /* writes out put to this file */
// Menu //
printf ("TSC Support Program\n\n");
printf ("This program compare the two files and output the differences in SQL quries \n");
// if the file is empty or something went wrong!
if (in1 == NULL || in2 == NULL)
{
printf("********************Can Not Read File !**************************");
}
else
{
// Checking Every line in the first text file if it equals to the first line of the text file
while (!feof(in1)&&!feof(in2))
{
// a = getc(in1);
// b = getc(in2);
a = fgets(singleline, 33,in1);
b = fgets(singleline2, 33,in2);
if (a!=b)
{
printf("\n SQL will be printed\n");
fprintf (out,
"\n DELETE FROM BADGELINK WHERE BADGEKEY=[27] AND ACCLVLID=75"
"\nINSERT INTO BADGELINK (BADGEKEY,ACCLVLID,ACTIVATE,DEACTIVATE) VALUES ([27],75,'[2010-08-24] 00:00:00','[2010-12-17] 00:00:00'); \n"
"\n DELETE FROM BADGE WHERE BADGEKEY=[27] AND ISSUECODE=[75]"
"\nINSERT INTO BADGE (BADGEKEY,ISSUECODE) VALUES ([27],[1]);\n"
);
}
else
{
printf("Something went wrong");
}
}
}
fclose(in1);
fclose(in2);
fclose(out);
}
It prints the output 5 times
and then it says something went wrong. I am unsure what went wrong.
if (a != b) does not do what you think it is doing. Check strncmp() or memcmp() library functions.
But if you want to find out the first different character in two strings, the code below would do it for you.
Not tested properly, so take it as a quick prototype.
#include <stdio.h>
int strdiff(char *s1, char *s2){
char *p1 = s1;
while(*s1++ == *s2++)
;
if (s1 != s2)
return --s1-p1; /* we have s1++ in the while loop */
return -1;
}
int main(){
char *s1="S00111111 5 6-Jul-19 09-Aug-19";
char *s2="S00111111 3 6-Jul-19 09-Aug-19";
int i = strdiff(s1,s2);
printf("%d %c\n",i, s1[i]);
return 0;
}
Mind you, comparing two files line by line may turn out to be a bigger mission than it sounds if the two files you are comparing do not have exactly the same lines (with minor differences of course).

How to read values from a text file and store them in different variable types in C

I need to read a text file with 7 lines into 7 different variables. The text file looks like this:
.2661
A.txt
B.txt
C.txt
1
2
0.5 0.6
These are the variables that I need to store each line into:
float value1; // line 1 from .txt file
char *AFileName; // line 2 from .txt file
char *BFileName; // line 3 from .txt file
char *CFileName; // line 4 from .txt file
int value2; // line 5 from .txt file
int lastLineLength; // line 6 from .txt file
double lastLine[lastLineLength]; // line 7 from .txt file - this can be different lengths
I have currently been doing this by just using the arguments when I call my program from the command line and the argv command.
First open the file using fopen with read access:
FILE *inputFile = fopen(filename, "r");
if(!inputFile) {
// Error opening file, handle it appropriately.
}
Then read the data from the file using fscanf. The first parameter is the FILE * we created above. The second parameter is a format string that specifies what fscanf should expect while reading the file. The remaining parameters are pointers to variables that will hold the data read from the file.
int variablesFound;
variablesFound = fscanf(inputFile, "%f\n%s\n%s\n%s\n%d\n%d\n", &value1, AFileName, BFileName, CFileName, &value2, &lastLineLength);
if(variablesFound < 6) {
// There was an error matching the file contents with the expected pattern, handle appropriately.
}
double lastLine[lastLineLength];
// Iterate over the last line.
int lastLineIndex;
for(lastLineIndex = 0; lastLineIndex < lastLineLength; lastLineIndex++) {
fscanf(inputFile, "%lf", &lastLine[lastLineIndex]);
fscanf(inputFile, " "); // Eat the space after the double.
}
Edit
After comments I realized it might be worth noting that you have to allocate memory to your variables as the real first step. The primitives (those with an & below) can be declared as normal. For the string (char array), you'll want to do one of the following:
char *aFileName = calloc(MAX_FILENAME_SIZE + 1, sizeof(char));
or
char aFileName[MAX_FILENAME_SIZE + 1];
Depending on what your purpose with aFileName would be determines which method would be appropriate. However, assuming this code appears in the main or doesn't need to exist beyond the scope of the function, the latter would be better as it doesn't require free()ing the variable after you're done with it.
It also may be worth while singling out the code that deals with reading input if your requirements change often.
You can read from the file as follows:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
FILE * fp;
char * line = NULL;
size_t len = 80;
fp = fopen("<path to your file>", "r");
if (fp == NULL)
exit(-1);
while (getline(&line, &len, fp) != -1)
printf("%s", line);
fclose(fp);
return 0;
}
getline reads character strings from the file, so you'd have to parse the lines as needed (atoi, atof).

read multiple fasta sequence using external library kseq.h

I am trying to find fasta sequences of 5 ids/name as provided by user from a big fasta file (containing 80000 fasta sequences) using an external header file kseq.h as in: http://lh3lh3.users.sourceforge.net/kseq.shtml. When I run the program in a for loop, I have to open/close the big fasta file again and again (commented in the code) which makes the computation time slow. On the contrary, if I open/close only once outside the loop, the program stops if it encounters an entry which is not present in the big fasta file I.e. it reaches end of the file. Can anyone suggest how to get all the sequences without losing computational time. The code is:
#include <zlib.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include "ext_libraries/kseq.h"
KSEQ_INIT(gzFile, gzread)
int main(int argc, char *argv[])
{
char gwidd_ids[100];
kseq_t *seq;
int i=0, nFields=0, row=0, col=0;
int size=1000, flag1=0, l=0, index0=0;
printf("Opening file %s\n", argv[1]);
char **gi_ids=(char **)malloc(sizeof(char *)*size);
for(i=0;i<size;i++)
{
gi_ids[i]=(char *)malloc(sizeof(char)*50);
}
FILE *fp_inp = fopen(argv[1], "r");
while(fscanf(fp_inp, "%s", gwidd_ids) == 1)
{
printf("%s\n", gwidd_ids);
strcpy(gi_ids[index0], gwidd_ids);
index0++;
}
fclose(fp_inp);
FILE *f0 = fopen("xxx.txt", "w");
FILE *f1 = fopen("yyy.txt", "w");
FILE *f2 = fopen("zzz", "w");
FILE *instream = NULL;
instream = fopen("fasta_seq_uniprot.txt", "r");
gzFile fpf = gzdopen(fileno(instream), "r");
for(col=0;col<index0;col++)
{
flag1=0;
// FILE *instream = NULL;
// instream = fopen("fasta_seq_nr_uniprot.txt", "r");
// gzFile fpf = gzdopen(fileno(instream), "r");
kseq_t *seq = kseq_init(fpf);
while((kseq_read(seq)) >= 0 && flag1 == 0)
{
if(strcasecmp(gi_ids[col], seq->name.s) == 0)
{
fprintf(f1, ">%s\n", gi_ids[col]);
fprintf(f2, ">%s\n%s\n", seq->name.s, seq->seq.s);
flag1 = 1;
}
}
if(flag1 == 0)
{
fprintf(f0, "%s\n", gi_ids[col]);
}
kseq_destroy(seq);
// gzclose(fpf);
}
gzclose(fpf);
fclose(f0);
fclose(f1);
fclose(f2);
for(i=0;i<size;i++)
{
free(gi_ids[i]);
}
free(gi_ids);
return 0;
}
A few examples of inputfile (fasta_seq_uniprot.txt) is:
P21306
MSAWRKAGISYAAYLNVAAQAIRSSLKTELQTASVLNRSQTDAFYTQYKNGTAASEPTPITK
P38077
MLSRIVSNNATRSVMCHQAQVGILYKTNPVRTYATLKEVEMRLKSIKNIEKITKTMKIVASTRLSKAEKAKISAKKMD
-----------
-----------
The user entry file is
P37592\n
Q8IUX1\n
B3GNT2\n
Q81U58\n
P70453\n
Your problem appears a bit different than you suppose. That the program stops after trying to retrieve a sequence that is not present in the data file is a consequence of the fact that it never rewinds the input. Therefore, even for a query list containing only sequences that are present in the data file, if the requested sequence IDs are not in the same relative order as the data file then the program will fail to find some of the sequences (it will pass them by when looking for an earlier-listed sequence, never to return).
Furthermore, I think it likely that the time savings you observe comes from making only a single pass through the file, instead of a (partial) pass for each requested sequence, not so much from opening it only once. Opening and closing a file is a bit expensive, but nowhere near as expensive as reading tens or hundreds of kilobytes from it.
To answer your question directly, I think you need to take these steps:
Move the kseq_init(seq) call to just before the loop.
Move the kseq_destroy(seq) call to just after the loop.
Put in a call to kseq_rewind(seq) as the last statement in the loop.
That should make your program right again, but it is likely to kill pretty much all your time savings, because you will return to scanning the file from the beginning for each requested sequence.
The library you are using appears to support only sequential access. Therefore, the most efficient way to do the job both right and fast would be to invert the logic: read sequences one at a time in an outer loop, testing each one as you go to see whether it matches any of the requested ones.
Supposing that the list of requested sequences will contain only a few entries, like your example, you probably don't need to do any better testing for matches than just using an inner loop to test each requested sequence id vs. the then-current sequence. If the query lists may be a lot longer, though, then you could consider putting them in a hash table or sorting them into the same order as the data file to make it possible to test more efficiently for matches.

Resources