Getting and processing user input across multiple lines - c

I'm trying to get multiple lines of input from a user via stdin (although eventually, I'd like to be able to specify a file). The idea is that the user specifies inputs within matching "<" and ">". I'd like them to be able to invoke the program and then type as many of these inputs as they'd like across multiple lines until they terminate the input with Control-D.
So they could do:
"[this is a valid
input even thought it spans
three lines]" (using brackets instead of < so that my text doesn't disappear!)
I discovered I could use fscanf to easily process these kinds of fragment by using something like:
fscanf(fp, "<" "%999[^>]" ">", buffer);
But I'm having a lot of difficulty getting this to work across multiple lines, and I'm not entirely sure about the best way to loop through these inputs and put the strings between < into an array containing just the relevant string.
I've done a bit of research and people seem to have differing opinions on the use of fgets versus sscanf versus fscanf, and I'm not really sure about what the merits of each are as they related to my particular problem. I'm also not sure how to make a newline not terminate the input (as it currently does). Should I be checking the number of matches that fscanf returns, or should I be looking for an EOF terminator?
Currently, my code looks like this (I was using matches earlier to check the number of each input but have since removed that). Obviously this doesn't yet move anything to an array for the sake of simplicity. Additionally, fp is currently stdin, but I'd like to keep it robust enough that I could simply change that pointer to a file in case I wanted to read from a file.
char buffer[1000];
while(true){
int matches = fscanf(fp, "{" "%999[^}]" "}", buffer);
if (feof(fp)) break;
printf("%s\n", buffer);
}

Related

Extracting the domain extension of a URL stored in a string using scanf()

I am writing a code that takes a URL address as a string literal as input, then runs the domain extension of the URL through an array and returns the index if finds a match, -1 if does not.
For example, an input would be www.stackoverflow.com, in this case, I'd need to extract only the com part. In case of www.google.com.tr, I'd need only com again, ignoring the .tr part.
I can think of basically writing a function that'll do that just fine but I'm wondering if it is possible to do it using scanf() itself?
It's really an overhead to use scanf here. But you can do this to realize something similar
char a[MAXLEN],b[MAXLEN],c[MAXLEN];
scanf("%[^.].%[^.].%[^. \n]",a,b,c);
printf("Desired part is = %s\n",c);
To be sure that formatting is correct you can check whether this scanf call is successful or not. For example:
if( 3 != scanf("%[^.].%[^.].%[^. \n]",a,b,c)){
fprintf(stderr,"Format must be atleast sth.something.sth\n");
exit(EXIT_FAILURE);
}
What is the other way of achieving this same thing. Use fgets to read the whole line and then parse with strtok with delimiters ".". This way you will get parts of it. With fgets you can easily support different kind of rules. Instead of incorporating it in scanf (which will be a bit difficult in error case), you can use fgets,strtok to do the same.
With the solution provided above only the first three parts of the url is being considered. Rest are not parsed. But this is hardly the practical situation. Most the time we have to process the whole information, all the parts of the url (and we don't know how many parts can be there). Then you would be better using fgets/strtok as mentioned above.

Difficulty in reading in a DNA sequence file using C code

I am currently taking Coursera Bioinformatics course, and doing the programming assignments. Being, a first year undergraduate just learning C, while I am aware that python is the language that's becoming much more popular for bioinformatics, I am challenging myself to implement every single algorithm in the course in the C language to master it, as it will also benefit me in all my CS courses here, which use C/C++ a lot.
In working on one of the assignments, our goal is to write a program that can take in a shorter DNA pattern and compare it with a long complete DNA strand, and the output the count, which is the number of times the shorter DNA pattern appears in the long complete DNA strand. We are given a file consisting of all the inputs we need, and the gold output, but I am having great trouble in parsing the file properly, even though I've consulted my textbook and numerous documentation. I actually have no problem implementing the algorithm itself; I tested it using smaller hardcoded character arrays in the program itself.
The input file is as follows:
Input
TAACAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTGAGCCTTTGAGCCTTTTAGCCTTTCAGCCTTTAGCCTTTAAGCCTTTCCGCATCGAGCCTTTCAGCCTTTCGTAGCCTTTCGAGCCTTTAGCCTTTCAGCCTTTAGAGCCTTTAAGCCTTTAGTCGATGTAGCCTTTAGCCTTTAGCCTTTAGCCTTTGCAGCCTTTAGTAGGCAAGCCTTTTAGCCTTTGAGCCTTTCGAGCCTTTCTCGCTAGCCTTTAGCCTTTGGTGAGCCTTTTAGCCTTTAGCCTTTTCGCAGCCTTTTGAGCCTTTCTTGTTTGAATGGCAAGAGCCTTTTCGAGCCTTTAGCCTTTGAGCCTTTCAGCCTTTAAAAGCCTTTCGTTAGCCTTTAGCCTTTATCGAGCCTTTAAGCCTTTTATGCAAAGCCTTTGAGCCTTTAGCCTTTCAGCCTTTCAGCCTTTCATTGACAAGCCTTTCAGCCTTTAGCCTTTAGCCTTTCTCAGCCTTTGAGCCTTTGAGCCTTTGTCGAGCCTTTTTTCAGAGCCTTTTAGCCTTTAGCCTTTGAGCCTTTAGCCTTTAGCCTTTTACGAGCCTTTGCAAGCCTTTCAGCCTTTCCAAGCCTTTAGCCTTTGCTTTAGCCTTTCATGGGATAGCCTTTAGCCTTTATTAAGCCTTTTTTATCAAGCCTTTGTAGCCTTTAAGCCTTTCCAGCCTTTAGGAGCCTTTGTATAGCCTTTTGAGCCTTTCTACAGTAAAGCCTTTTTTGGTCAGCCTTTCTAGCCTTTGATAGCCTTTCTGAAGCCTTTGGCGGAGCCTTTCTGTTAACAGCCCAGCCTTTCTCATAGCCTTTGCGGTATCAGCCTTTGCAGCCTTTCTGGAGCGATAGCCTTTCAGCCTTTCCGAGCCTTTTTCAGAGCCTTTGAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTCCGAGCCTTTTCAGCCTTTACAGCCTTTTTAGCCTTTGAGCCTTTCACAGCCTTTGAGCTAGCCTTTAAGTTAAAGCCTTTAGCCTTTCAGCCTTTACATTAGCCTTTTAGCCTTTTCAGCCTTTAGAGCCTTTAGCCTTTGCTGAAGCCTTTAGTAGCCTTTAGCCTTTGCGAAGCCTTTCTGGTGCAACAAGTGAAGCCTTTGCCCTAGCCTTTGCTAGCCTTTCCGAGCCTTTGTCGATATAGCCTTTAGCCTTTAGAAAGCCTTTAGCCTTTGCTAGCCTTTATAGCCTTTAGCCTTTAGCCTTTCCCAGCCTTTAGCCTTTATCCTAAGCCTTTAGCCTTTTCCAGAAGCAGCCTTTTGATCAGAGCCTTTCTTCGGACTGCTCCCAGCCTTTAGCCTTTCAGCCTTTAGCCTTTAGCCTTTCAGCCTTTAGCCTTTTCTAGCCTTTAGCCTTTGTGTAGCCTTTCTAGCCTTTACGAGCCTTTGCCCAGCCTTTCCCAGCCTTTGAGAGCCTTTACCATATAGCCTTTACATAAGCCTTTGATGAGCCTTTAGCCTTTCGAGCCTTTCAGCCTTTACTCCAGCCTTTATAGCCTTTATATAGCCTTTCCTGTTAGGCCGTCGGTGCAGCCTTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTTAGCCTTTCAGCCTTTGCTAGCCTTTCCAGCCTTTACAGCCTTTTACCGAGCCTTTCCATCAGCCTTTAGCCTTTATATCTCTGATCGGGTAGCCTTTCGGCTAGCCTTTGGTAGCCTTTTTCAGCCTTTATGTAAAGCCTTTGTAGCCTTTGATGTGAGCCTTTAGAAGCCTTTGTCAGCCTTTTAGCCTTTAGCCTTTAGCCTTTTTAAGCCTTTTACAAGCCTTTACTCAGAGCCTTTACGAGCCTTTTAGCCTTTGCAGCCTTTTAGCCTTTTGATGAGCCTTTGCGGAGCCTTTTCTTTCAGCCTTTTGGCAGCCTTTAGCCTTTGTACAGCCTTTTTGAGCACAGCCTTTCGCGAAAGAGCCTTTATCAGCCTTTAAGCCTTTGCTAGCCTTTAGCCTTTTGAGCCTTTAGCCTTTGTATCTGTCTATCATCGAGCCTTTCTAAGCCTTTGCGGAAGCCTTTAGCCTTTGTCAGCCTTTCAAAGCCTTTAGCCTTTTATTCAAGCCTTTGAACCATAGCCTTTGGCAGCCTTTCAAGCCTTTGACGACAGCCTTTAGCCTTTCATTAGCCTTTTAGGAGGCTCATCCGTCTAGCCTTTAAATAGCCTTTAGCCTTTATAGCCTTTAAGCCTTTAGCCTTTTAAGCCTTTGAGAGCCTTTAAAGCCTTTAAACCAAGCCTTTGCGAAGCCTTTAGCCTTTAGCCTTTCGCAGCCTTTAGCCTTTCGAGCCTTTTGTAGCCTTTTGGAGAGCCTTTGGGCAAGCCTTTAGTATAAGCCTTTAGCCTTTTAGCCTTTCAAGCCTTTAGCCTTTAAGCCTTTGGCACAGCCTTTTGAGCCTTTCAGGAGCCTTTATTGGTGAGCCTTTAGTATAGCCTTTTCAGCCTTTGGCAGCCTTTAATGAAAGCCTTTGCTCAGCCTTTTTCAGCCTTTACAGCCTTTCAAGCCTTTAGCCACAAGCCTTTAGCCTTTAGCCTTTCAGCCTTTGCGAGCCTTTGTAGCCTTTAAGCCTTTCAGCCTTTAGCCTTTAGCCTTTTTCAAGCCTTTTCAGCCTTTCAAAGCCTTTCAAGCCTTTGAAGCCTTTCTAAGCCTTTGAGCCTTTGAGCCTTTGAGCCTTTAGCCTTTGTTCCTAGCCTTTATAGCCTTTTAGGCAGCCTTTCAGAGCCTTTTAAGCCTTTAGCCTTTCAGAAAGAGCCTTTAGCCCAGCCTTTTGATTAGCCTTTAGGGAACAGCCTTTAGCCTTTTAAGCCTTTGGTATACAATCAACGCAGCCTTTAGCCTTTAAGCCTTTTGGAGCCTTTCAGACTGATCCCAGCCTTTCAGCCTTTCTCAGCCTTTAAGCCTTTCTCCAAGCCTTTTGAGCCTTTTCGAGCCTTTAGTGAGCCTTTTGAAGCCTTTGTTTAGCCTTTTGTATAGGGTAGCCTTTAGCCTTTCCGGAAGCCTTTTGTAGCCTTTAAGCCTTTTGTCCGGGAAAGCCTTTGTAAGCCTTTAATGCAGCCTTTCCTATAGCCTTTAAGCCTTTCAGCCTTTTGGAGCCTTTTCTCAGCCTTTAGCCTTTCGCCAGCCTTTCTCCCGAGCAGCCTTTTAGAAAAAGCCTTTTAGCCTTTTACCGTGGACAGCCTTTCACGAGCCTTTACAGGCTAGCCTTTAGCCTTTGCTAGCCTTTTCCCAGCCTTTTGAGCCTTTAAGCCTTTCTAAGTTCTACGCTTGGGCTAAAGCCTTTAGCCTTTAAGCCTTTCAGCCTTTTGCAGCCTTTATATAACTTGAGCCTTTAGCCTTTAGCCTTTATAGCCTTTAGCCTTTTAGCCTTTTATATCCCTTAAGCCTTTGTAAGCCTTTAGCCTTTAAGCCTTTACGAGGAAAGCCTTTCATGCAGCCTTTAGCCTTTAGCCTTTGAGCCTTTCCAGCCTTTCAGCCTTTCAGCCTTTAGCCTTTAGCCTTTAGCCTTTTAGCCTTTATGAGCCTTTATAGCCTTTAGCCTTTTCACCAGCCTTTCCAGATGCACAAGCCTTTCAGCCTTTAGCCTTTCGAGCCTTTGGCTTATAGCCTTTCATCAGCCTTTCTAGCCTTTTAGCCTTTAGCCTTTAGCCTTTTCTAGCCTTTCAGCCTTTAGCCTTTTCGAAGCCTTTAGCCTTTTTAGCCTTTAGCTCAGCCTTTAGCCTTTATCTAACAGCCTTTAGCCTTTAGCCTTTAAAGCCTTTATGTCCAATTCTAACAGCCTTTAGCCTTTAAAGCCTTTGCAGCCTTTGAGCCTTTTAGCCTTTGAAGCCTTTAGCCTTTGTCAGCCTTTCCAGCCTTTTAGCCTTTAGCAGCCTTTAGTACGCCAGCCTTTAGCCTTTGTATAAGCCTTTAGCCTTTAGCCTTTCCACTAGCCTTTAGCCTTTAGAGGAGCGATAGCCTTTCAGCCTTTAGAAAGCCTTTGTTGCTGCTAGCCTTTGGGTTCTCAGCCTTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTTGTAGCCTTTTACATAGGATTGATTCAAAAGCCTTTTTGAGCCTTTCTGCATTAGCCTTTTCCTCTAGCCTTTAGCCTTTCGCAGCCTTTAGCCTTTTAGAGCCTTTAGATAGCCTTTCGCGACAGCCTTTTGTTTAGCCTTTAGCCTTTGTTAGCCTTTGAGCCTTTGAGCCTTTTAGCCTTTCCTAGCCTTTCAGCCTTTCCAAAGCCTTTGACAGGGTGTAGCCTTTCTAGCCTTTTTAGCCTTTAGCCTTTAAACTTAAGCCTTTTTAGCCTTTAGCCTTTCAACCCAGCCTTTAGCCTTTTAAGCCTTTAGCCTTTAGCCTTTTTAGAAGCCTTTTAGCCTTTAGCCTTTGGAGCCTTTCAGATCTCAGCCTTTTCGAGCCTTTTAGCCTTTTCAGAAAAGTAGCCTTTTTAGCAGCCTTTTAAAGCCTTTGGAGCCTTTAGCCTTTAGCCTTTGTAGCCTTTTCCCAAAAGCCTTTACAGCCTTTGTGAGCCTTTTAGTTCGTTTGAGCCTTTCCAGCCTTTCAGCCTTTAGCCTTTATAGCCTTTTGCGAGAAGCCTTTAAGCCTTTAGCCTTTTGACGTTCTAGAGCCTTTGGAGCCTTTCACGCGAGCCTTTCAAGCCTTTGACTCCGCAGCCTTTTCGCGACCAGCCTTTGCCGTGCCAGCCTTTAGCCTTTCAACACAGCCTTTAGCCTTTGGGCCGCAGAGCCTTTGAGTAGCCTTTAGCCTTTGACAGCCTTTAGCCTTTCTAGCCTTTGCAGCCTTTGTCTAGGTAGCCTTTAGCCTTTAGCCTTTCTAGCCTTTTAGCCTTTAGCCTTTTGAGCCTTTTGGAAGCCTTTCAGCCTTTAGCCTTTCGCGAGCCTTTGAGCCTTTACCCAGCCTTTACGGAGCCTTTAGCCTTTCCCATAGCCTTTAGCCTTTCCAGCCTTTAGCCTTTTAGCCTTTCAAATCTAAGCCTTTCGCATATATGGTAGCCTTTAGCCTTTAGCCTTTATGGTCCTTCAGTTTGAGCCTTTTAGAGCCTTTAAAGGAGCCTTTGTAAGACGAAGGTAGCCTTTAGCCTTTGCCAGCCTTTTTAGCCTTTAGCCTTTAAAAAGCCTTTGAGCCTTTAGCCTTTAGCCTTTGAGCCTTTAGCCTTTTCTCCTAGCCTTTCATAGCCTTTGAGCCTTTAGCCTTTTAGCCTTTTAGCCTTTAGCCTTTAGCCTTTGGAGGTCAGCCTTTATGTTAAAGCCTTTAGTTCCCAGCCTTTCAGCCTTTAGCCTTTAGCCTTTGAGCCTTTCAGCCTTTTAGCCTTTCAGCCTTTCAGCCTTTGAAGCCTTTTGTAGCCTTTGCCCGAGCCTTTAGCCTTTAGCCTTTCCCAACCCTGATCCGTAGCCTTTGGGCTGATCCTGAGCCTTTTCAGCCTTTAAGCCTTTAGCCTTTAGCCTTTGAGAAGCCTTTAGCCTTTCAGCCTTTAACAGCCTTTAAGCCTTTATAGCCTTTAGCCAGCCTTTGCAGCCTTTCAGTAGCCTTTAGCCTTTAGCCTTTCTAGCCTTTCTTGGAGCCTTTCCCAGCCTTTAAGAGCCTTTAGCCTTTTAGCCTTTCAGCCTTTAGCCTTTTCGTAGCCTTTGACCATTGTCAGCCTTTCTACTGAGCCTTTCATAGCCTTTTTTAGCCTTTCTAGCAGCCTTTGGAGCCTTTAGAAGAGCCTTTAGCCTTTTAAGCCTTTGAGCCTTTAACACAAGCCTTTATCTGGGCCGCGAGCCTTTTCAACCTAACTACAGCCTTTCTAAGCCTTTAGCCTTTAGCCTTTCAGCCTTTTAGCCTTTACCGAGCCTTTGCGGGAAGCCTTTAAAGAGCCTTTAGAAAAAGCCTTTGGGATAGCCTTTCCAGCCTTTCCAGCCTTTTTAGCCTTTTCCTCAAGATTTAGCCTTTGATGAAGCCTTTGAGCCTTTAGCCTTTCATTGAGCCTTTTAAGCCTTTCAGCCTTTTCTCATCAGCCTTTCACAGCCTTTCTACAGCCTTTAGCCTTTAGCCTTTGGAGCCTTTTCGCCCCGAGCCTTTAGCCTTTAGCCTTTTAGCCTTTCAGCCTTTGTAGCCTTTAGAGCCTTTGCTTAGCCTTTAGCCTTTAGTAGCCTTTAGATAGCCTTTTCTGGGAGCCTTTACAGCCTTTAGCCTTTAGCCTTTAGCCTTTTAAAGCCTTTCCCCAAAGCCTTTGTTGAGCCTTTAGCCTTTACAGTCTAGCCTTTAGCCTTTCAAGCCTTTACCTTAGCCTTTGGCAGCCTTTCTAGCCTTTAGCCTTTTCAGCCTTTAGCCTTTAAGCCTTTAGCCTTTTCGAGCCTTTGAGCCTTTAAGCCTTTATAAAAAGCCTTTAGCCTTTAAGCCTTTACCAGCCTTTAGCCTTTCAGCCTTTTATCGGAAAGCCTTTAAGCCTTTTAGCCTTTCAGCCTTTGAGCCTTTCAGCCTTTAGCCTTTGGCAAAGCCTTTTTGCAGCCTTTGGAAGCCTTTAGCCTTTTTCAAGCCTTTCAGCCTTTAGCCTTTGCACGTATTAGGAAGCCTTTTACTCTAAGCCTTTATCAGCCTTTAGCCTTTAGCCTTTAAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTAGCCTTTACGGTCAGCCTTTGGTAGCCTTTTCAGCCTTTAAGCCTTTAAGCCTTTGAGCCTTTAGCCTTTAGCCTTTGAGCCTTTAAGAGCCTTTCAGCCTTTTTTAGCCTTTTAGCCTTTGAGCCTTTCCTAGCCTTTCAAGCCTTTGAGCCTTTCGAAGCCTTTTAGCCTTTAGCCTTTAGCCTTTATGGAGCCTTTAGCCTTTAGCGGAGCCTTTGAGCCTTTACAGAGCCTTTAGCCTTTAGCCTTTTAAGCCTTTTGCAGCCTTTCAAAGAGCCTTTAGCCTTTACGGAGCCTTTAGCCTTTAAGCCTTTCTCACTAGCCTTTTTAGCCTTTGAGCCTTTATGACGAAGCCTTTAGCCTTTTGTCGTGACCTGAGCCTTTAGCCTTTACAGCCTTTCAGCCTTTAGCCTTTCTTAAAAGCCTTTTAGCCTTTTTGAGCCTTTACAGCCTTTCGAGCCTTTGAGCCTTTCCCAGCCTTTGAAGCCTTTTGGACAGAGCCTTTGCTAGCCTTTAGCCTTTTAGCCTTTAGCCTTTAGCCTTTACTTAGCCTTTTAGCCTTTATGGATAGCCTTTAGCCTTTGAGAGCCTTTGCCTAGCCTTTGAAGCCTTTTTAGCCTTTAACGAGCCTTTAGCCTTTAGCCTTTAGCCTTTAAGCCTTTAGCCTTTCGAGCCTTTCTCAGCCTTTGTAGCCTTTAGCCTTTAGAGCAGCCTTTAGCCTTTCCAGCCTTTAGCCTTTTCAGCCTTTAGCCTTTCAGCCTTTGCCCCGAGCACGTAGCCTTTACAGCCTTTAGCCTTTAGCCTTTTAGCCTTTACAGCCTTTTGAGCCTTTAGCCTTTGAAAGCCTTTTGAAGAGCCTTTCAGCCTTTCTTACTAGCCTTTGCAGCCTTTTAGCCTTTCCGAGCCTTTGATAGCCTTTGTCGGTAAGCCTTTGTAGAGCCTTTAGCCTTTAAGCCTTTGGTAAAGAGCCTTTTCAACAGCCTTTCGGAGCCTTTCGCTACAAGCCTTTTGGCCTAGCCTTTAGCCTTTCAGCCTTTCAAGAGCCTTTAGCCTTTCGCAGCCTTTATAGCCTTTCAGCCTTTCAGCCTTTAGCCTTTAGAGCCTTTGAGCCTTTCGTTATCTAAGCCTTTACTCCATAGCCTTTGAGCCTTTAGCCTTTGTCAGTCGAGCCTTTGTTCTTGAGCCTTTAGCCTTTGCAGCCTTTAGCCTTTTGTTTGTGGAGCCTTTAGCCTTTGAATACAGCCTTTAGCCTTTAGCCTTTAGCCTTTCTAGCCTTTCAGCAGCCTTTGTAGCCTTTGAACCAGCCTTTAGCCTTTTAGCCTTTTCCTTAGCCTTTCCAGCCTTTTAGTGAGCCTTTAGCCTTTGCACCAGCCTTTAGCCTTTAGCCTTTCAGCCTTTAGCCTTTCGAGCCTTTTAGCCTTTGAACAGCCTTTTGAGCCTTTGACGATATGAGCCTTTAGCCTTTTGTAGCCTTTTTTAGCCTTTGAACAGCCTTTGGAGTCAAGCCTTTACGCAGCCTTTCCAGCCTTTCAGCCTTTAGCCTTTGGTCAGCCTTTTCAGAGCCTTTGCGGTTAGCCTTTGAATAGCCTTTAAAGCCTTTCTCAGCCTTTGTAAGCCTTTAGCCTTTTAGCCTTTGTGAGCCTTTCAGCCTTTCCGAGCCTTTAGCCTTTGCCTACGGAAGCCTTTAGCCTTTGCTATCAGCTTGAGCCTTTTAGCCTTTAGTAGCAGCCTTTTAGCCTTTTAGCCTTTCAGCCTTTCTCTAGCCTTTAGCCTTTATCCGAGCCTTTACCAGCCTTTGAGCCTTTAGCCTTTATAGCCTTTATACGTAGCTAGCCTTTAGCCTTTAGAGCCTTTACCCTGTACCAGCCTTTAAGCCTTTCTCGTGAAGCCTTTAGCCTTTGAGCCTTTCGAGCCTTTAGCCTTTAGCCTTTAAGCCTTTTTGTGTGAGCCTTTAGCCTTTGGGGAGCCTTTAGCCTTTCAGCCTTTTAGCCTTTTCAAGCCTTTAGCCTTTAGCCTTTTGAGCCTTTAAAGCCTTTAGCCTTTAGGTAGCAAGCCTTTCGTTATAGCCTTTTATAAGCCTTTTTTAATGAGCCTTTAGCCTTTAGCCTTTGAGCAGCCTTTAGCCTTTAGTAGCCTTTTGATATTAGCCTTTCAGCCTTTAGCCTTTCCCCGAGCCTTTGTTAGAGCCTTTGCAGCCTTTGGAGCCTTTAGCCTTTCGGAGCCTTTAGCCTTTGGGACAGCCTTTAGCCTTTAGCCTTTGAAGCCTTTTGCAGCCTTTAAGATAGCCTTTGAGCCTTTTCAGCCTTTACAGCCTTTAAGCCTTTAGCCTTTGAGCCTTTGAGCCTTTTGAGCCTTTTAGCCTTTGTTGCAGCCTTTAGCCTTTAGCCTTTTAGCCTTTAGCCTTTAGCCTTTGAGCCTTTGAGCCTTTTAGCCTTTAGCCTTTGAGCCTTTTGGACAGCCTTTCTGAGCCTTTCGTAGCCTTTACCGCAAGCCTTTATAGCCTTTGAAGAGGAGCCTTTATAGCCTTTCAGAAGCCTTTTAAGCCTTTTCGCAGCCTTTTATCAGCCTTTAGCCTTTAGCCTTTTAGCCTTTCAGCCTTTAGCCTTTACAAGCCTTTAGCCTTTAGCCTTTATCAAGCCTTTCTAGCCTTTGAGCCTTTGTGAGCCTTTGTGTCAGCCTTTCAAGCCTTTTTAAGTACAGCCTTTACTCAGCCTTTATAGCCTTTGTCGTAAGCCTTTAGCCTTTAGCCTTTGAAAAGCCTTTACGCACAGACAAGTAGCCTTTCAGCCTTTAAGCCTTTGAGTATGTCCTTGAGCCTTTAAAAGAGCCTTTGGTAGCCTTTAGCCTTTAGCCTTTTATAGCCTTTAAGCCTTTAAGCCTTT
AGCCTTTAG
Output
294
First line is the string "Input"
Second line is entire DNA sequence (which as the label "Input" implies, is my input DNA sequence, and which I will refer to it throughout as text)
Third line the shorter DNA sequence, which I will refer to as pattern.
Fourth line is string "Output"
Fifth line is the gold output, the number of counts which my program should be returning.
I tried parsing the file with the following code:
int main(int argc, char **argv)
{
if(argc>1)
{
FILE * dataset = fopen(argv[1], "r");
if(dataset==NULL)
{
printf("File count not be opened or found!\n");
return 1;
}
char in_label[1000], dna_text[10000], dna_pattern[1000], out_label[1000];
int count=0;
fscanf(dataset, "%s, %s, %s, %s, %d", in_label, dna_text,dna_pattern, out_label,&count);
... and other code below that calls the counting algorithm which I won't show here ...
While my call to fscanf does return me in_label correctly, it does not work for the remaining arguments. Basically when I printf out each of my in_label, dna_text, dna_pattern, out_label and count, only in_label correctly gives me the string Input, but the rest all are garbage. I'm really confused, because I thought that the fscanf function automatically skips linefeeds or spaces when reading in from the stream. So why did Input get correctly read into in_label, but not the others???
Also a second question I have is about one shortcoming that I'm aware of in my program, which are the hardcoded array sizes. I know about malloc function, and just learnt about it this week in class, but I just can't figure out how to use it here. Because in order to use malloc, we need to be able to at least "soft code" the size of our array in advance, and here, I just can't imagine how I would be able to tell the compiler, in any "soft coded" manner, what my array sizes will be especially for the dna_text array, which varies greatly from dataset to dataset.
C is really challenging, a world away from python whose convenience I've been so spoilt by. I would greatly appreciate any help to overcome this issue, so that I can move on with my learning of bioinformatics. Thank you very much!
You can use fstat() to get the file size and malloc() for allocating proper buffers
Use fgets() for reading text files line by line. Do not use fscanf() at all.
Never read a file at once as you should never know what exactly went wrong if your reading API returned an error. My experience tells me that reading line-by-line is the best strategy when working with text files. Just be sure you have a buffer large enough for storing the longest possible line.

sscanf returns 1 despite the line not matching the search pattern

I need to read certain parameters from a file. My problem is to scan the lines of the file to find the parameters. The text file is structured in lines like:
character *\n
So each line has to have the pattern (space or tab)character[space or tab][char][space or tab]\n.
Spaces or tabs at the beginning are optional. I tried to do it with
char val;
if(sscanf(buf, "%*[ \t]character%*[ \t]%c%*[ \t]\n",&val)==1||sscanf(buf, "character%*[ \t]%c%*[ \t]\n",&val)==1){
printf("%c in %i\n", val,line);
}else{
fprintf(stderr,"Error while reading line %i\n",line);
}
buf contains the current line.
My problem is, in lines like character \n my program does not print an error. Instead it saves '\n' in val. I do not understand this behavior, because this line does not match my search pattern.
What is wrong with my code?
My understanding of my
I do not understand this behavior, because this line does not match my search pattern.
The *scanf functions do not check the pattern first and then, if it matches, fill in the values. They check one character at a time, and indicate how many of the fields in the format string they were able to use.
Unfortunately, in your case, %c can certainly match '\n'. The subsequent %*[ \t] fails, as would the subsequent \n, but since those aren't stored anywhere, they don't affect sscanf's return value, so you can't tell from the result whether there was any error.
The simplest way to solve this might be to not use *scanf functions at all. Your input format is easily described using a custom routine, but not so easily with a format string.

Reading parts of a file after a specific tag is found in C using fgets

I would like some suggestion on how to read an 'XML' like file in such a way that the program would only read/store elements observed in a node that meets some requirements. I was thinking about using two fgets in the following way:
while (fgets(file_buffer,line_buffer,fp) != NULL)
{
if (p_str = (char*) strstr(file_buffer,"<element of interest opening")) )
{
//new fgets that starts at fp and runs only until the end of the node
{
//read and process
}
}
}
Does this make sense or are there smarter ways of doing this?
Secondly (in my idea), will i have to define a new FILE* (like fr), set fr to fp at the start of the second fgets or can i somehow abuse the original filepointer for that?
Use an XML parser like Xmllib2 http://xmlsoft.org/xml.html
Your approach seems isn't bad for the job.
You could read the whole line from the file, then, process it using sprintf, strstr or whatever functions you like. This will save you time and unnecessary overheads with FILE I/O.
As per your second idea, you can use fseek()(Refer: man fseek) or rewind()(Refer: man rewind) using the same file pointer fp. You do not need an extra file pointer.
EDIT:
If you could change the tag format to adhere to XML structure you will be able to use libXML2 and such libraries properly.
If that's not possible, then you have to write your own parser.
A few pointers:
First extract data from the file into a buffer.The size of the buffer and whether dynamically or statically allocated, will depend on your specs.
Search in the buffer, if non-whitespace character is < or whatever character your tag usually begins with. If not, you can just show an error and exit.
Now follows the tag name, until the first whitespace, or the / or the > character. Store them. Process the =, strings and stuff, as you wish.
If the next non-whitespace character is /, check that it is followed by >(or a similar pattern in your specs to find if a tag ends). If so, you've finished parsing and can return your result. Otherwise, you've got a malformed tag and should exit with an error.
If the character is >, then you've found the end of the begin tag. Now follows the content.
Otherwise what follows is an argument. Parse that, store the result, continue at step 4.
Read the content until you find a < character.
If that character is followed by /, it's the end tag. Check that it is followed by the tag name and >. If yes, return the result , else, throw an error.
If you get here, you've found the beginning of a nested XML. Parse that with this algorithm and then continue at 4 again.
Although, its quite basic idea, I hope it should help you start.
EDIT:
If you still want to reference the file as a pointer consider using mmap().
If you add mmap with a bit of shared memory IPC and adequate memory locking stuff, you could write a parallel processing program, that will process most of your files faster.

How to use sscanf correctly and safely

First of all, other questions about usage of sscanf do not answer my question because the common answer is to not use sscanf at all and use fgets or getch instead, which is impossible in my case.
The problem is my C professor wants me to use scanf in a program. It's a requirement.
However the program also must handle all the incorrect input.
The program must read an array of integers. It doesn't matter in what format the integers
for the array are supplied. To make the task easier, the program might first read the size of the array and then the integers each in a new line.
The program must handle the inputs like these (and report errors appropriately):
999999999999999...9 (numbers larger than integer)
12a3 (don't read this as an integer 12)
a...z (strings)
11 aa 22 33\n all in one line (this might be handled by discarding everything after 11)
inputs larger than the input array
There might be more incorrect cases, these are the only few I could think of.
If the erroneous input is supplied, the program must ask the user to input again until
the correct input is given, but the previous correct input must be kept (only incorrect
input must be cleared from the input stream).
Everything must conform to C99 standard.
The scanf family of function cannot be used safely, especially when dealing with integers. The first case you mentioned is particularly troublesome. The standard says this:
If this object does not have an appropriate type, or if the result of
the conversion cannot be represented in the object, the behavior is
undeļ¬ned.
Plain and simple. You might think of %5d tricks and such but you'll find they're not reliable. Or maybe someone will think of errno. The scanf functions aren't required to set errno.
Follow this fun little page: they end up ditching scanf altogether.
So go back to your C professor and ask them: how exactly does C99 mandate that sscanf will report errors ?
Well, let sscanf accept all inputs as %s (i.e. strings) and then program analyze them
If you must use scanf to accept the input, I think you start with something a bit like the following.
int array[MAX];
int i, n;
scanf("%d", &n);
for (i = 0; i < n && !feof(stdin); i++) {
scanf("%d", &array[i]);
}
This will handle (more or less) the free-format input problem since scanf will automatically skip leading whitespace when matching a %d format.
The key observation for many of the rest of your concerns is that scanf tells you how many format codes it parsed successfully. So,
int matches = scanf("%d", &array[i]);
if (matches == 0) {
/* no integer in the input stream */
}
I think this handles directly concerns (3) and (4)
By itself, this doesn't quite handle the case of the input12a3. The first time through the loop, scanf would parse '12as an integer 12, leaving the remaininga3` for the next loop. You would get an error the next time round, though. Is that good enough for your professor's purposes?
For integers larger than maxint, eg, "999999.......999", I'm not sure what you can do with straight scanf.
For inputs larger than the input array, this isn't a scanf problem per se. You just need to count how many integers you've parsed so far.
If you're allowed to use sscanf to decode strings after they've been extracted from the input stream by something like scanf("%s") you could also do something like this:
while (...) {
scanf("%s", buf);
/* use strtol or sscanf if you really have to */
}
This works for any sequence of white-space separated words, and lets you separate scanning the input for words, and then seeing if those words look like numbers or not. And, if you have to, you can use scanf variants for each part.
The problem is my C professor wants me to use scanf in a program.
It's a requirement.
However the program also must handle all the incorrect input.
This is an old question, so the OP is not in that professor's class any more (and hopefully the professor is retired), but for the record, this is a fundamentally misguided and basically impossible requirement.
Experience has shown that when it comes to interactive user input, scanf is suitable only for quick-and-dirty situations when the input can be assumed to correct.
If you want to read an integer (or a floating-point number, or a simple string) quickly and easily, then scanf is a nice tool for the job. However, its ability to gracefully handle incorrect input is basically nonexistent.
If you want to read input robustly, reliably detecting incorrect input, and perhaps warning the user and asking them to try again, scanf is simply not the right tool for the job. It's like trying to drive screws with a hammer.
See this answer for some guidelines for using scanf safely in those quick-and-dirty situations. See this question for suggestions on how to do robust input using something other than scanf.
scanf("%s", string) into long int_string = strtol(string, &end_pointer, base:10)

Resources