Flex - Function that compares strings in C - c

I'm programing with flex using C, a C code compiler and I want to compare strings on a file, in this case my symbol table, with yytext. If yytext and the respective string of the table are the same one it should exit the function and if there are no instances on the table then the function will write the string down on the symbol table.
This is my function:
search (char *x){
int c;
int n = 0;
char *cdn;
while ((c = fgetc(comp)) != EOF){
fscanf(comp, "%s", cdn);
if (strcmp(cdn, yytext) == 0){
n++; //if n>0 when it finishes searching the file then there's a copy on the file
}else{}
return 0;
}
if (n==0){
fprintf(comp, "%d\t %s\n", pos++, yytext); //will write if there's no copy in the table
}else return 0;}
The input for the function is yytext, yytext will have for example "a".
After running this, the program doesn't write anything and it need to be closed manually. (More like program.extension has stopped working.)
Can someone help me with this?

First of all, putting your symbol table into a file is a very debatable design choice:
symbols are checked very frequently, so file accesses will slow down your compiler a lot.
symbols will be associated with a lot of grammatical informations (for instance, they may represent a variable with an associated type), so storing only the names will not be enough to make later stages of your compiler work.
If you store all symbol informations into a file, you will have to re-read the entire file and convert each bit of information into a memory representation each time you will want to access a given symbol.
This is not only inefficient, it will also force you to write tons of unnecessary and complicated code.
Now for your search function.
Regardless of the current bugs, what your function does is not a search, though you need to search your file to make it work.
What your function does is create a unique list of yytext values. The "search" you're performing inside it simply makes sure an already present value is not duplicated.
The very first thing to do would be to give it a less misleading name, or modify it so that it does what its name implies.
Now for the bugs
If for some reason you still want to use a file, I suppose you will put each name on a single line.
So why not use fgets(), that will take care of the line endings for you?
Whatever method you are using to read each name, you will have to provide a buffer with actual storage space for the string, not just an unitialized pointer.
If your input string is yytext, your x parameter will never be used.
Lastly, your search function (which inserts the current yytext value into an unsorted list), has no reason to return anything (except an error code if your disk gets full and you can't add new names to the list).

Related

Difficulties understanding how to take elements from a file and store them in C

I'm working on an assignment that is supposed to go over the basics of reading a file and storing the information from that file. I'm personally new to C and struggling with the lack of a "String" variable.
The file that the program is supposed to work with contains temperature values, but we are supposed to account for "corrupted data". The assignment states:
Every input item read from the file should be treated as a stream of characters (string), you can
use the function atof() to convert a string value into a floating point number (invalid data can be
set to a value lower than the lowest minimum to identify it as corrupt)."
The number of elements in the file is undetermined but an example given is:
37.8, 38.a, 139.1, abc.5, 37.9, 38.8, 40.5, 39.0, 36.9, 39.8
After reading the file we're supposed to allow a user to query these individual entries, but as mentioned if the data entry contains a non-numeric value, we are supposed to state that the specific data entry is corrupted.
Overall, I understand how to functionally write a program that can fulfill those requirements. My issue is not knowing what data structure to use and/or how to store the information to be called upon later.
The closest to an actual string datatype which you find in C is a sequence of chars which is terminated by a '\0' value. That is used for most things which you'd expect to do with strings.
Storing them requires just sufficent memory, as offered by a sufficiently large array of char, or as offered by malloc().
I think the requirements of your assignment would be met by making a char array as buffer, then reading in with fgets(), making sure to not read more than fits into your array and making sure that there is a '\0' at the end.
Then you can use atof() on the content of the array and if it fails do the handling of corrupted input. Though I would prefer sscanf() for its better feedback via separate return value.

C - how does sprintf() work?

I'm new to coding in C and was writing a program which needs to create custom filenames that increment from 0. To make the filenames, I use sprintf() to concatenate the counter and file extension like so:
int main(void)
{
// do stuff
int count = 0;
while (condition == true)
{
char filename[7];
sprintf(filename, "0%02d.txt", count) //count goes up to a max of 50;
count++;
//check condition
}
return 0;
}
However, every time
sprintf(filename, "0%02d.txt", count);
runs, count gets reset to 0.
My question is, what does sprintf() do with count? Why does count change after being passed to sprintf()?
Any help would be much appreciated.
EDIT: Sorry, I haven't been too clear with the code in my question - I'm writing the program for an exercise on an online course, and count goes up to a max of 50. I've now changed my code to reflect that. Also, thanks for telling me about %04d, I was using a complicated if statement to determine how many zeroes to add to my filename to make it 3-digit.
Despite the title of the question, this has nothing to do with sprintf(), which probably works as expected, but everything with count.
If count is a global variable (i.e. outside any functions), then it should keep its value between function calls. So that is probably not the case.
If it is a local variable (declared inside the function), then it can have any value, since these lose their value when the function ends and don't get initialized when the function is run again. It can be always 0, but under different circumstances, it can just as well be something else. In other words, the value is more or less undetermined.
To have a local variable keep its value between function calls, make it static.
static int count = 0;
But note that when you stop and run the program again, it will start as 0 again. That means you would possibly overwrite 000.txt, then 001.txt, etc.
If you really want to avoid duplicate file names, you will have to be more sophisticated, and see which files are already there, determine the highest number, and increment that by one. So you don't use a variable, you check the files that already exist. That is far more work, but the only reliable way to avoid overwriting existing files with such numbered file names.
FWIW, I would use something like "00%04d.txt" as format string, so you get files 000000.txt, 000001.txt, etc. which look better in an alphabetically sorted file listing than 000.txt, 001.txt, 0010.txt, 0011.txt, 002.txt, etc. They are also easier to parse for their number.
As Weather Vane noticed, be sure to make your buffer a little larger, e.g.
char filename[20];
A buffer that is too small is a problem. One that is too large is not, unless it is huge and clobbers up the stack. That risk is very small with 20 chars.
I think it is likely the sprintf. "0%02d.txt" is 7 chars. The null at the end of the string will go in the next location, which is likely the count on the stack. On a little endian machine (x86), that likely means the bottom byte of count gets zeroed out in every sprintf().
As other folks said. Make the filename buffer larger.

Difficulty in reading in a DNA sequence file using C code

I am currently taking Coursera Bioinformatics course, and doing the programming assignments. Being, a first year undergraduate just learning C, while I am aware that python is the language that's becoming much more popular for bioinformatics, I am challenging myself to implement every single algorithm in the course in the C language to master it, as it will also benefit me in all my CS courses here, which use C/C++ a lot.
In working on one of the assignments, our goal is to write a program that can take in a shorter DNA pattern and compare it with a long complete DNA strand, and the output the count, which is the number of times the shorter DNA pattern appears in the long complete DNA strand. We are given a file consisting of all the inputs we need, and the gold output, but I am having great trouble in parsing the file properly, even though I've consulted my textbook and numerous documentation. I actually have no problem implementing the algorithm itself; I tested it using smaller hardcoded character arrays in the program itself.
The input file is as follows:
Input
TAACAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTGAGCCTTTGAGCCTTTTAGCCTTTCAGCCTTTAGCCTTTAAGCCTTTCCGCATCGAGCCTTTCAGCCTTTCGTAGCCTTTCGAGCCTTTAGCCTTTCAGCCTTTAGAGCCTTTAAGCCTTTAGTCGATGTAGCCTTTAGCCTTTAGCCTTTAGCCTTTGCAGCCTTTAGTAGGCAAGCCTTTTAGCCTTTGAGCCTTTCGAGCCTTTCTCGCTAGCCTTTAGCCTTTGGTGAGCCTTTTAGCCTTTAGCCTTTTCGCAGCCTTTTGAGCCTTTCTTGTTTGAATGGCAAGAGCCTTTTCGAGCCTTTAGCCTTTGAGCCTTTCAGCCTTTAAAAGCCTTTCGTTAGCCTTTAGCCTTTATCGAGCCTTTAAGCCTTTTATGCAAAGCCTTTGAGCCTTTAGCCTTTCAGCCTTTCAGCCTTTCATTGACAAGCCTTTCAGCCTTTAGCCTTTAGCCTTTCTCAGCCTTTGAGCCTTTGAGCCTTTGTCGAGCCTTTTTTCAGAGCCTTTTAGCCTTTAGCCTTTGAGCCTTTAGCCTTTAGCCTTTTACGAGCCTTTGCAAGCCTTTCAGCCTTTCCAAGCCTTTAGCCTTTGCTTTAGCCTTTCATGGGATAGCCTTTAGCCTTTATTAAGCCTTTTTTATCAAGCCTTTGTAGCCTTTAAGCCTTTCCAGCCTTTAGGAGCCTTTGTATAGCCTTTTGAGCCTTTCTACAGTAAAGCCTTTTTTGGTCAGCCTTTCTAGCCTTTGATAGCCTTTCTGAAGCCTTTGGCGGAGCCTTTCTGTTAACAGCCCAGCCTTTCTCATAGCCTTTGCGGTATCAGCCTTTGCAGCCTTTCTGGAGCGATAGCCTTTCAGCCTTTCCGAGCCTTTTTCAGAGCCTTTGAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTCCGAGCCTTTTCAGCCTTTACAGCCTTTTTAGCCTTTGAGCCTTTCACAGCCTTTGAGCTAGCCTTTAAGTTAAAGCCTTTAGCCTTTCAGCCTTTACATTAGCCTTTTAGCCTTTTCAGCCTTTAGAGCCTTTAGCCTTTGCTGAAGCCTTTAGTAGCCTTTAGCCTTTGCGAAGCCTTTCTGGTGCAACAAGTGAAGCCTTTGCCCTAGCCTTTGCTAGCCTTTCCGAGCCTTTGTCGATATAGCCTTTAGCCTTTAGAAAGCCTTTAGCCTTTGCTAGCCTTTATAGCCTTTAGCCTTTAGCCTTTCCCAGCCTTTAGCCTTTATCCTAAGCCTTTAGCCTTTTCCAGAAGCAGCCTTTTGATCAGAGCCTTTCTTCGGACTGCTCCCAGCCTTTAGCCTTTCAGCCTTTAGCCTTTAGCCTTTCAGCCTTTAGCCTTTTCTAGCCTTTAGCCTTTGTGTAGCCTTTCTAGCCTTTACGAGCCTTTGCCCAGCCTTTCCCAGCCTTTGAGAGCCTTTACCATATAGCCTTTACATAAGCCTTTGATGAGCCTTTAGCCTTTCGAGCCTTTCAGCCTTTACTCCAGCCTTTATAGCCTTTATATAGCCTTTCCTGTTAGGCCGTCGGTGCAGCCTTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTTAGCCTTTCAGCCTTTGCTAGCCTTTCCAGCCTTTACAGCCTTTTACCGAGCCTTTCCATCAGCCTTTAGCCTTTATATCTCTGATCGGGTAGCCTTTCGGCTAGCCTTTGGTAGCCTTTTTCAGCCTTTATGTAAAGCCTTTGTAGCCTTTGATGTGAGCCTTTAGAAGCCTTTGTCAGCCTTTTAGCCTTTAGCCTTTAGCCTTTTTAAGCCTTTTACAAGCCTTTACTCAGAGCCTTTACGAGCCTTTTAGCCTTTGCAGCCTTTTAGCCTTTTGATGAGCCTTTGCGGAGCCTTTTCTTTCAGCCTTTTGGCAGCCTTTAGCCTTTGTACAGCCTTTTTGAGCACAGCCTTTCGCGAAAGAGCCTTTATCAGCCTTTAAGCCTTTGCTAGCCTTTAGCCTTTTGAGCCTTTAGCCTTTGTATCTGTCTATCATCGAGCCTTTCTAAGCCTTTGCGGAAGCCTTTAGCCTTTGTCAGCCTTTCAAAGCCTTTAGCCTTTTATTCAAGCCTTTGAACCATAGCCTTTGGCAGCCTTTCAAGCCTTTGACGACAGCCTTTAGCCTTTCATTAGCCTTTTAGGAGGCTCATCCGTCTAGCCTTTAAATAGCCTTTAGCCTTTATAGCCTTTAAGCCTTTAGCCTTTTAAGCCTTTGAGAGCCTTTAAAGCCTTTAAACCAAGCCTTTGCGAAGCCTTTAGCCTTTAGCCTTTCGCAGCCTTTAGCCTTTCGAGCCTTTTGTAGCCTTTTGGAGAGCCTTTGGGCAAGCCTTTAGTATAAGCCTTTAGCCTTTTAGCCTTTCAAGCCTTTAGCCTTTAAGCCTTTGGCACAGCCTTTTGAGCCTTTCAGGAGCCTTTATTGGTGAGCCTTTAGTATAGCCTTTTCAGCCTTTGGCAGCCTTTAATGAAAGCCTTTGCTCAGCCTTTTTCAGCCTTTACAGCCTTTCAAGCCTTTAGCCACAAGCCTTTAGCCTTTAGCCTTTCAGCCTTTGCGAGCCTTTGTAGCCTTTAAGCCTTTCAGCCTTTAGCCTTTAGCCTTTTTCAAGCCTTTTCAGCCTTTCAAAGCCTTTCAAGCCTTTGAAGCCTTTCTAAGCCTTTGAGCCTTTGAGCCTTTGAGCCTTTAGCCTTTGTTCCTAGCCTTTATAGCCTTTTAGGCAGCCTTTCAGAGCCTTTTAAGCCTTTAGCCTTTCAGAAAGAGCCTTTAGCCCAGCCTTTTGATTAGCCTTTAGGGAACAGCCTTTAGCCTTTTAAGCCTTTGGTATACAATCAACGCAGCCTTTAGCCTTTAAGCCTTTTGGAGCCTTTCAGACTGATCCCAGCCTTTCAGCCTTTCTCAGCCTTTAAGCCTTTCTCCAAGCCTTTTGAGCCTTTTCGAGCCTTTAGTGAGCCTTTTGAAGCCTTTGTTTAGCCTTTTGTATAGGGTAGCCTTTAGCCTTTCCGGAAGCCTTTTGTAGCCTTTAAGCCTTTTGTCCGGGAAAGCCTTTGTAAGCCTTTAATGCAGCCTTTCCTATAGCCTTTAAGCCTTTCAGCCTTTTGGAGCCTTTTCTCAGCCTTTAGCCTTTCGCCAGCCTTTCTCCCGAGCAGCCTTTTAGAAAAAGCCTTTTAGCCTTTTACCGTGGACAGCCTTTCACGAGCCTTTACAGGCTAGCCTTTAGCCTTTGCTAGCCTTTTCCCAGCCTTTTGAGCCTTTAAGCCTTTCTAAGTTCTACGCTTGGGCTAAAGCCTTTAGCCTTTAAGCCTTTCAGCCTTTTGCAGCCTTTATATAACTTGAGCCTTTAGCCTTTAGCCTTTATAGCCTTTAGCCTTTTAGCCTTTTATATCCCTTAAGCCTTTGTAAGCCTTTAGCCTTTAAGCCTTTACGAGGAAAGCCTTTCATGCAGCCTTTAGCCTTTAGCCTTTGAGCCTTTCCAGCCTTTCAGCCTTTCAGCCTTTAGCCTTTAGCCTTTAGCCTTTTAGCCTTTATGAGCCTTTATAGCCTTTAGCCTTTTCACCAGCCTTTCCAGATGCACAAGCCTTTCAGCCTTTAGCCTTTCGAGCCTTTGGCTTATAGCCTTTCATCAGCCTTTCTAGCCTTTTAGCCTTTAGCCTTTAGCCTTTTCTAGCCTTTCAGCCTTTAGCCTTTTCGAAGCCTTTAGCCTTTTTAGCCTTTAGCTCAGCCTTTAGCCTTTATCTAACAGCCTTTAGCCTTTAGCCTTTAAAGCCTTTATGTCCAATTCTAACAGCCTTTAGCCTTTAAAGCCTTTGCAGCCTTTGAGCCTTTTAGCCTTTGAAGCCTTTAGCCTTTGTCAGCCTTTCCAGCCTTTTAGCCTTTAGCAGCCTTTAGTACGCCAGCCTTTAGCCTTTGTATAAGCCTTTAGCCTTTAGCCTTTCCACTAGCCTTTAGCCTTTAGAGGAGCGATAGCCTTTCAGCCTTTAGAAAGCCTTTGTTGCTGCTAGCCTTTGGGTTCTCAGCCTTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTTGTAGCCTTTTACATAGGATTGATTCAAAAGCCTTTTTGAGCCTTTCTGCATTAGCCTTTTCCTCTAGCCTTTAGCCTTTCGCAGCCTTTAGCCTTTTAGAGCCTTTAGATAGCCTTTCGCGACAGCCTTTTGTTTAGCCTTTAGCCTTTGTTAGCCTTTGAGCCTTTGAGCCTTTTAGCCTTTCCTAGCCTTTCAGCCTTTCCAAAGCCTTTGACAGGGTGTAGCCTTTCTAGCCTTTTTAGCCTTTAGCCTTTAAACTTAAGCCTTTTTAGCCTTTAGCCTTTCAACCCAGCCTTTAGCCTTTTAAGCCTTTAGCCTTTAGCCTTTTTAGAAGCCTTTTAGCCTTTAGCCTTTGGAGCCTTTCAGATCTCAGCCTTTTCGAGCCTTTTAGCCTTTTCAGAAAAGTAGCCTTTTTAGCAGCCTTTTAAAGCCTTTGGAGCCTTTAGCCTTTAGCCTTTGTAGCCTTTTCCCAAAAGCCTTTACAGCCTTTGTGAGCCTTTTAGTTCGTTTGAGCCTTTCCAGCCTTTCAGCCTTTAGCCTTTATAGCCTTTTGCGAGAAGCCTTTAAGCCTTTAGCCTTTTGACGTTCTAGAGCCTTTGGAGCCTTTCACGCGAGCCTTTCAAGCCTTTGACTCCGCAGCCTTTTCGCGACCAGCCTTTGCCGTGCCAGCCTTTAGCCTTTCAACACAGCCTTTAGCCTTTGGGCCGCAGAGCCTTTGAGTAGCCTTTAGCCTTTGACAGCCTTTAGCCTTTCTAGCCTTTGCAGCCTTTGTCTAGGTAGCCTTTAGCCTTTAGCCTTTCTAGCCTTTTAGCCTTTAGCCTTTTGAGCCTTTTGGAAGCCTTTCAGCCTTTAGCCTTTCGCGAGCCTTTGAGCCTTTACCCAGCCTTTACGGAGCCTTTAGCCTTTCCCATAGCCTTTAGCCTTTCCAGCCTTTAGCCTTTTAGCCTTTCAAATCTAAGCCTTTCGCATATATGGTAGCCTTTAGCCTTTAGCCTTTATGGTCCTTCAGTTTGAGCCTTTTAGAGCCTTTAAAGGAGCCTTTGTAAGACGAAGGTAGCCTTTAGCCTTTGCCAGCCTTTTTAGCCTTTAGCCTTTAAAAAGCCTTTGAGCCTTTAGCCTTTAGCCTTTGAGCCTTTAGCCTTTTCTCCTAGCCTTTCATAGCCTTTGAGCCTTTAGCCTTTTAGCCTTTTAGCCTTTAGCCTTTAGCCTTTGGAGGTCAGCCTTTATGTTAAAGCCTTTAGTTCCCAGCCTTTCAGCCTTTAGCCTTTAGCCTTTGAGCCTTTCAGCCTTTTAGCCTTTCAGCCTTTCAGCCTTTGAAGCCTTTTGTAGCCTTTGCCCGAGCCTTTAGCCTTTAGCCTTTCCCAACCCTGATCCGTAGCCTTTGGGCTGATCCTGAGCCTTTTCAGCCTTTAAGCCTTTAGCCTTTAGCCTTTGAGAAGCCTTTAGCCTTTCAGCCTTTAACAGCCTTTAAGCCTTTATAGCCTTTAGCCAGCCTTTGCAGCCTTTCAGTAGCCTTTAGCCTTTAGCCTTTCTAGCCTTTCTTGGAGCCTTTCCCAGCCTTTAAGAGCCTTTAGCCTTTTAGCCTTTCAGCCTTTAGCCTTTTCGTAGCCTTTGACCATTGTCAGCCTTTCTACTGAGCCTTTCATAGCCTTTTTTAGCCTTTCTAGCAGCCTTTGGAGCCTTTAGAAGAGCCTTTAGCCTTTTAAGCCTTTGAGCCTTTAACACAAGCCTTTATCTGGGCCGCGAGCCTTTTCAACCTAACTACAGCCTTTCTAAGCCTTTAGCCTTTAGCCTTTCAGCCTTTTAGCCTTTACCGAGCCTTTGCGGGAAGCCTTTAAAGAGCCTTTAGAAAAAGCCTTTGGGATAGCCTTTCCAGCCTTTCCAGCCTTTTTAGCCTTTTCCTCAAGATTTAGCCTTTGATGAAGCCTTTGAGCCTTTAGCCTTTCATTGAGCCTTTTAAGCCTTTCAGCCTTTTCTCATCAGCCTTTCACAGCCTTTCTACAGCCTTTAGCCTTTAGCCTTTGGAGCCTTTTCGCCCCGAGCCTTTAGCCTTTAGCCTTTTAGCCTTTCAGCCTTTGTAGCCTTTAGAGCCTTTGCTTAGCCTTTAGCCTTTAGTAGCCTTTAGATAGCCTTTTCTGGGAGCCTTTACAGCCTTTAGCCTTTAGCCTTTAGCCTTTTAAAGCCTTTCCCCAAAGCCTTTGTTGAGCCTTTAGCCTTTACAGTCTAGCCTTTAGCCTTTCAAGCCTTTACCTTAGCCTTTGGCAGCCTTTCTAGCCTTTAGCCTTTTCAGCCTTTAGCCTTTAAGCCTTTAGCCTTTTCGAGCCTTTGAGCCTTTAAGCCTTTATAAAAAGCCTTTAGCCTTTAAGCCTTTACCAGCCTTTAGCCTTTCAGCCTTTTATCGGAAAGCCTTTAAGCCTTTTAGCCTTTCAGCCTTTGAGCCTTTCAGCCTTTAGCCTTTGGCAAAGCCTTTTTGCAGCCTTTGGAAGCCTTTAGCCTTTTTCAAGCCTTTCAGCCTTTAGCCTTTGCACGTATTAGGAAGCCTTTTACTCTAAGCCTTTATCAGCCTTTAGCCTTTAGCCTTTAAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTACGGTCAGCCTTTGGTAGCCTTTTCAGCCTTTAAGCCTTTAAGCCTTTGAGCCTTTAGCCTTTAGCCTTTGAGCCTTTAAGAGCCTTTCAGCCTTTTTTAGCCTTTTAGCCTTTGAGCCTTTCCTAGCCTTTCAAGCCTTTGAGCCTTTCGAAGCCTTTTAGCCTTTAGCCTTTAGCCTTTATGGAGCCTTTAGCCTTTAGCGGAGCCTTTGAGCCTTTACAGAGCCTTTAGCCTTTAGCCTTTTAAGCCTTTTGCAGCCTTTCAAAGAGCCTTTAGCCTTTACGGAGCCTTTAGCCTTTAAGCCTTTCTCACTAGCCTTTTTAGCCTTTGAGCCTTTATGACGAAGCCTTTAGCCTTTTGTCGTGACCTGAGCCTTTAGCCTTTACAGCCTTTCAGCCTTTAGCCTTTCTTAAAAGCCTTTTAGCCTTTTTGAGCCTTTACAGCCTTTCGAGCCTTTGAGCCTTTCCCAGCCTTTGAAGCCTTTTGGACAGAGCCTTTGCTAGCCTTTAGCCTTTTAGCCTTTAGCCTTTAGCCTTTACTTAGCCTTTTAGCCTTTATGGATAGCCTTTAGCCTTTGAGAGCCTTTGCCTAGCCTTTGAAGCCTTTTTAGCCTTTAACGAGCCTTTAGCCTTTAGCCTTTAGCCTTTAAGCCTTTAGCCTTTCGAGCCTTTCTCAGCCTTTGTAGCCTTTAGCCTTTAGAGCAGCCTTTAGCCTTTCCAGCCTTTAGCCTTTTCAGCCTTTAGCCTTTCAGCCTTTGCCCCGAGCACGTAGCCTTTACAGCCTTTAGCCTTTAGCCTTTTAGCCTTTACAGCCTTTTGAGCCTTTAGCCTTTGAAAGCCTTTTGAAGAGCCTTTCAGCCTTTCTTACTAGCCTTTGCAGCCTTTTAGCCTTTCCGAGCCTTTGATAGCCTTTGTCGGTAAGCCTTTGTAGAGCCTTTAGCCTTTAAGCCTTTGGTAAAGAGCCTTTTCAACAGCCTTTCGGAGCCTTTCGCTACAAGCCTTTTGGCCTAGCCTTTAGCCTTTCAGCCTTTCAAGAGCCTTTAGCCTTTCGCAGCCTTTATAGCCTTTCAGCCTTTCAGCCTTTAGCCTTTAGAGCCTTTGAGCCTTTCGTTATCTAAGCCTTTACTCCATAGCCTTTGAGCCTTTAGCCTTTGTCAGTCGAGCCTTTGTTCTTGAGCCTTTAGCCTTTGCAGCCTTTAGCCTTTTGTTTGTGGAGCCTTTAGCCTTTGAATACAGCCTTTAGCCTTTAGCCTTTAGCCTTTCTAGCCTTTCAGCAGCCTTTGTAGCCTTTGAACCAGCCTTTAGCCTTTTAGCCTTTTCCTTAGCCTTTCCAGCCTTTTAGTGAGCCTTTAGCCTTTGCACCAGCCTTTAGCCTTTAGCCTTTCAGCCTTTAGCCTTTCGAGCCTTTTAGCCTTTGAACAGCCTTTTGAGCCTTTGACGATATGAGCCTTTAGCCTTTTGTAGCCTTTTTTAGCCTTTGAACAGCCTTTGGAGTCAAGCCTTTACGCAGCCTTTCCAGCCTTTCAGCCTTTAGCCTTTGGTCAGCCTTTTCAGAGCCTTTGCGGTTAGCCTTTGAATAGCCTTTAAAGCCTTTCTCAGCCTTTGTAAGCCTTTAGCCTTTTAGCCTTTGTGAGCCTTTCAGCCTTTCCGAGCCTTTAGCCTTTGCCTACGGAAGCCTTTAGCCTTTGCTATCAGCTTGAGCCTTTTAGCCTTTAGTAGCAGCCTTTTAGCCTTTTAGCCTTTCAGCCTTTCTCTAGCCTTTAGCCTTTATCCGAGCCTTTACCAGCCTTTGAGCCTTTAGCCTTTATAGCCTTTATACGTAGCTAGCCTTTAGCCTTTAGAGCCTTTACCCTGTACCAGCCTTTAAGCCTTTCTCGTGAAGCCTTTAGCCTTTGAGCCTTTCGAGCCTTTAGCCTTTAGCCTTTAAGCCTTTTTGTGTGAGCCTTTAGCCTTTGGGGAGCCTTTAGCCTTTCAGCCTTTTAGCCTTTTCAAGCCTTTAGCCTTTAGCCTTTTGAGCCTTTAAAGCCTTTAGCCTTTAGGTAGCAAGCCTTTCGTTATAGCCTTTTATAAGCCTTTTTTAATGAGCCTTTAGCCTTTAGCCTTTGAGCAGCCTTTAGCCTTTAGTAGCCTTTTGATATTAGCCTTTCAGCCTTTAGCCTTTCCCCGAGCCTTTGTTAGAGCCTTTGCAGCCTTTGGAGCCTTTAGCCTTTCGGAGCCTTTAGCCTTTGGGACAGCCTTTAGCCTTTAGCCTTTGAAGCCTTTTGCAGCCTTTAAGATAGCCTTTGAGCCTTTTCAGCCTTTACAGCCTTTAAGCCTTTAGCCTTTGAGCCTTTGAGCCTTTTGAGCCTTTTAGCCTTTGTTGCAGCCTTTAGCCTTTAGCCTTTTAGCCTTTAGCCTTTAGCCTTTGAGCCTTTGAGCCTTTTAGCCTTTAGCCTTTGAGCCTTTTGGACAGCCTTTCTGAGCCTTTCGTAGCCTTTACCGCAAGCCTTTATAGCCTTTGAAGAGGAGCCTTTATAGCCTTTCAGAAGCCTTTTAAGCCTTTTCGCAGCCTTTTATCAGCCTTTAGCCTTTAGCCTTTTAGCCTTTCAGCCTTTAGCCTTTACAAGCCTTTAGCCTTTAGCCTTTATCAAGCCTTTCTAGCCTTTGAGCCTTTGTGAGCCTTTGTGTCAGCCTTTCAAGCCTTTTTAAGTACAGCCTTTACTCAGCCTTTATAGCCTTTGTCGTAAGCCTTTAGCCTTTAGCCTTTGAAAAGCCTTTACGCACAGACAAGTAGCCTTTCAGCCTTTAAGCCTTTGAGTATGTCCTTGAGCCTTTAAAAGAGCCTTTGGTAGCCTTTAGCCTTTAGCCTTTTATAGCCTTTAAGCCTTTAAGCCTTT
AGCCTTTAG
Output
294
First line is the string "Input"
Second line is entire DNA sequence (which as the label "Input" implies, is my input DNA sequence, and which I will refer to it throughout as text)
Third line the shorter DNA sequence, which I will refer to as pattern.
Fourth line is string "Output"
Fifth line is the gold output, the number of counts which my program should be returning.
I tried parsing the file with the following code:
int main(int argc, char **argv)
{
if(argc>1)
{
FILE * dataset = fopen(argv[1], "r");
if(dataset==NULL)
{
printf("File count not be opened or found!\n");
return 1;
}
char in_label[1000], dna_text[10000], dna_pattern[1000], out_label[1000];
int count=0;
fscanf(dataset, "%s, %s, %s, %s, %d", in_label, dna_text,dna_pattern, out_label,&count);
... and other code below that calls the counting algorithm which I won't show here ...
While my call to fscanf does return me in_label correctly, it does not work for the remaining arguments. Basically when I printf out each of my in_label, dna_text, dna_pattern, out_label and count, only in_label correctly gives me the string Input, but the rest all are garbage. I'm really confused, because I thought that the fscanf function automatically skips linefeeds or spaces when reading in from the stream. So why did Input get correctly read into in_label, but not the others???
Also a second question I have is about one shortcoming that I'm aware of in my program, which are the hardcoded array sizes. I know about malloc function, and just learnt about it this week in class, but I just can't figure out how to use it here. Because in order to use malloc, we need to be able to at least "soft code" the size of our array in advance, and here, I just can't imagine how I would be able to tell the compiler, in any "soft coded" manner, what my array sizes will be especially for the dna_text array, which varies greatly from dataset to dataset.
C is really challenging, a world away from python whose convenience I've been so spoilt by. I would greatly appreciate any help to overcome this issue, so that I can move on with my learning of bioinformatics. Thank you very much!
You can use fstat() to get the file size and malloc() for allocating proper buffers
Use fgets() for reading text files line by line. Do not use fscanf() at all.
Never read a file at once as you should never know what exactly went wrong if your reading API returned an error. My experience tells me that reading line-by-line is the best strategy when working with text files. Just be sure you have a buffer large enough for storing the longest possible line.

Finding a string in a file [C]

Could anyone tell me the way how to find string (which you enter in a program) in a .txt file without using function for that?(Just need an algorithm for that nothing else) EXAMPLE: i have file named NAMES.txt with surnames on the first line separate with space like that:
John Peter Paul
and in my program I enter name for example Paul and it finds it in that file and write "the name is there"
name = Paul;
I have one method on my mind that if i enter for example Paul to my program it would scan all chars one by one in that file in a row and if name[1] = P then it would start scaning and comparing letters and if they were the same it would each time increase counter p by one (p++) and if p = lenghth of name then the name would be there (there might be 1 bug which comes to my mind that if you enter Paul and in the file theres name Paula it will actually write "The name is there" if i used that method but it should not be impossible to debug)
Could anyone also tell me if my written method is possible to realize ?
I suggest avoiding reading the entire file into memory. Large files might result in large memory consumption, which is far from ideal.
Presumably you have the string to search for in memory somewhere; it's already in an allocation. Create another allocation of the size of that string, and read that many bytes into it... Don't forget to account for the '\0' string terminator.
Check to see if it matches. If the string matches, well, obviously you've found a match within that file. If it doesn't, shift the array one byte left, read another byte onto the end of it. Rinse, lather, repeat until you find a match.
The bug you mentioned implies that you need a string terminator, in the file, somewhere. Technically speaking, a string terminator is a '\0', but you could substitute any terminal value(s). Just replace the value(s) you choose (perhaps whitespace?) with a '\0' as you're reading.
Function fgets() will be useful for this task as you can store whole string in buffer instead of one character at a time .
fgets(name,100,fp)
Where name is string pointer where string read is stored ,100 is the number of characters to be read and fp is the FILE pointer from where you want to read.
And then you can use function strcmp() to compare the string and name you want to search .So it will eliminate the other possibility of matching with a different name.

Writing and Reading to and from a binary file C language

I was given an assignment to manage a sort of music store. In the database (which is saved as a .dat file) we have an artists name, and the album.
I'm having problems writing and reading the file.
First thing is, even if I don't write anything, just create the file, and then open the file in notepad, i see gibberish and letters in chinese or japanese.
Even if i write to the fail, or read from it using visual studio, this doesnt seem to change. Here's my code:
I opened the file with:
p=fopen("database.dat","w+");
The add item function:
void add_item(char* artist,char* record,FILE* p) //adds an item with artist and record to store
{
item node;
int item_size=sizeof(item);
rewind(p);
strcpy(node.artist,artist);
strcpy(node.record,record);
fwrite(&node,item_size,1,p);
printf("Data added\n");
}
item is the struct that is used to define a single item in the store. it has 2 fields, string artist and string record.
typedef struct item
{
char artist[100],record[100];
}item;
This is for reading:
void print_file(FILE* p) //print the entire file
{
int size=sizeof(item);
item node;
rewind(p);
while(!feof(p))
{
fread(&node,size,1,p);
printf("%s - %s\n",node.artist,node.record);
}
}
if i use print_file, i see gibberish, if i actually open the file with notepad, i see japanese.
Help! :D
edit: Just discovered something. If I add an item, and then read the file, i will read the item. But if i run the program again, and try to read the file immediatly, i see gibberish.
Problem is with "w+", it will truncate existing file to zero length and open it for write & read. When you run the program second time this is what happens and your read (before write) returns gibberish.
More on fopen here
I'm guessing based on your symptoms that the strings are defined as char* and not char[], which would be your problem. String IO on files doesn't work that way. Keep in mind that a string is actually of type char*, that is, a pointer to one or more 8-bit integers. When you write the string to the file, you're actually writing the value of the address itself, not the characters it points to.
You should use the fprintf() function to write strings to files:
http://www.cplusplus.com/reference/cstdio/fprintf/
And then fscanf() to read them:
http://www.cplusplus.com/reference/cstdio/fscanf/
In general, if you're going to use a struct as a file format, you can't put pointers in the struct. You could put char[] values, because they aren't handled the same as pointers, but that would require a hard limit on string size. This is one of the reasons why structs as file formats are discouraged by some-- better to read and write the values one at a time, handling strings and so on appropriately.
The reason it works when you read the file back immediately (before quitting your program) is because that pointer address is still valid for that string. But the string itself never got written into the file.

Resources