How can I parse text input and convert strings to integers? - c

I have a file input, in which i have the following data.
1 1Apple 2Orange 10Kiwi
2 30Apple 4Orange 1Kiwi
and so on. I have to read this data from file and work on it but i dont know how to retrieve the data. I want to store 1(of 1 apple) as integer and then Apple as a string.
I thought of reading the whole 1Apple as a string. and then doing something with the stoi function.
Or I could read the whole thing character by character and then if the ascii value of that character lies b/w 48 to 57 then i will combine that as an integer and save the rest as string? Which one shall I do? Also how do I check what is the ASCII value of the char. (shall I convert the char to int and then compare, or is there any inbuilt function?)

How about using the fscanf() function if and only if your input pattern is not going to change. Otherwise you should probably use fgets() and perform checks if you want to separate the number from the string such as you suggested.

There is one easy right way to do this with standard C library facilities, one rather more difficult right way, and a whole lot of wrong ways. This is the easy right way:
Read an entire line into a char[] buffer using fgets.
Extract numbers from this line using strtol or strtoul.
It is very important to understand why the easier-looking alternatives (*scanf and atoi) should never be used. You might write less code initially, but once you start thinking about how to handle even slightly malformed input, you will discover that you should have used strtol.
The "rather more difficult right way" is to use lex and yacc. They are much more complicated but also much more powerful. You shouldn't need them for this problem.

Related

Difficulties understanding how to take elements from a file and store them in C

I'm working on an assignment that is supposed to go over the basics of reading a file and storing the information from that file. I'm personally new to C and struggling with the lack of a "String" variable.
The file that the program is supposed to work with contains temperature values, but we are supposed to account for "corrupted data". The assignment states:
Every input item read from the file should be treated as a stream of characters (string), you can
use the function atof() to convert a string value into a floating point number (invalid data can be
set to a value lower than the lowest minimum to identify it as corrupt)."
The number of elements in the file is undetermined but an example given is:
37.8, 38.a, 139.1, abc.5, 37.9, 38.8, 40.5, 39.0, 36.9, 39.8
After reading the file we're supposed to allow a user to query these individual entries, but as mentioned if the data entry contains a non-numeric value, we are supposed to state that the specific data entry is corrupted.
Overall, I understand how to functionally write a program that can fulfill those requirements. My issue is not knowing what data structure to use and/or how to store the information to be called upon later.
The closest to an actual string datatype which you find in C is a sequence of chars which is terminated by a '\0' value. That is used for most things which you'd expect to do with strings.
Storing them requires just sufficent memory, as offered by a sufficiently large array of char, or as offered by malloc().
I think the requirements of your assignment would be met by making a char array as buffer, then reading in with fgets(), making sure to not read more than fits into your array and making sure that there is a '\0' at the end.
Then you can use atof() on the content of the array and if it fails do the handling of corrupted input. Though I would prefer sscanf() for its better feedback via separate return value.

Extracting the domain extension of a URL stored in a string using scanf()

I am writing a code that takes a URL address as a string literal as input, then runs the domain extension of the URL through an array and returns the index if finds a match, -1 if does not.
For example, an input would be www.stackoverflow.com, in this case, I'd need to extract only the com part. In case of www.google.com.tr, I'd need only com again, ignoring the .tr part.
I can think of basically writing a function that'll do that just fine but I'm wondering if it is possible to do it using scanf() itself?
It's really an overhead to use scanf here. But you can do this to realize something similar
char a[MAXLEN],b[MAXLEN],c[MAXLEN];
scanf("%[^.].%[^.].%[^. \n]",a,b,c);
printf("Desired part is = %s\n",c);
To be sure that formatting is correct you can check whether this scanf call is successful or not. For example:
if( 3 != scanf("%[^.].%[^.].%[^. \n]",a,b,c)){
fprintf(stderr,"Format must be atleast sth.something.sth\n");
exit(EXIT_FAILURE);
}
What is the other way of achieving this same thing. Use fgets to read the whole line and then parse with strtok with delimiters ".". This way you will get parts of it. With fgets you can easily support different kind of rules. Instead of incorporating it in scanf (which will be a bit difficult in error case), you can use fgets,strtok to do the same.
With the solution provided above only the first three parts of the url is being considered. Rest are not parsed. But this is hardly the practical situation. Most the time we have to process the whole information, all the parts of the url (and we don't know how many parts can be there). Then you would be better using fgets/strtok as mentioned above.

How do I fscanf data within range only, instead of saving the whole data?

Basically I have rows and columns of data. I know I can use fgets to read line by line and then tokenise the line with strtok. After that I can check if the last 2 tokens/values are within range using atoi(), and if they are, I can then store them into an array. However, I heard strtok is a bad way to do things, and fscanf seems a much cleaner approach. The problems with fscanf is that if I use it I will have to store all the values in arrays first, and then check which values are within range. Since I have a lot of rows, I don't know how big the arrays should be and it would waste a lot of space. Is there a way to fscanf with if statements?
I don't know if it's a stupid question, thanks.
Whatever you waste it will never exceed the line length. Since yours is clearly a text format, a line length of 1024 is pretty typically the limit¹.
So, in worst case you require (several?) kilobytes of memory to parse each line. You can reuse that buffer and ignore the uninteresting values.
Of course, you can write your own parser and be more memory efficient.
UPDATE
There's also this: scanf:
(optional) assignment-suppressing character *. If this option is present, the function does not assign the result of the conversion to any receiving argument.
¹ it may vary of course

How do I accept only the numeric values from a file in C?

I have taken up a project and I would like some help. Basically it is a program to check whether some pins are connected or not on a board.
(Well, that's the simplified version. The whole thing is a circuit with a microcontroller.)
The problem is that, when a pin is connected I get a numeric value, and when it's not connected, I get no value, as in it's a blank in my table.
How can I accept these values?
I need to accept even the blank, to know that its not connected,
plus the table contains some other non-numeric values as well.
I tried reading the file using the fscanf() function but it didn't quite work. I'm aware of only fscanf(), fread(), fgets() and fgetc() functions to read from different kinds of files.
Also, is it possible to read data from an Excel file using C?
An example of the table is:
FROM TO
1 39
2
Over here, the numbers 1 and 2 are under the column FROM and it tells which pin the first end of the connector is connected to. The numbers under TO tell us which pin the other end of the connector is connected to, and when the column is blank, it's not connected at one end.
Now what I'm trying to do is create a program to create an assembly language program for the micro controller, so I need to be able to read whether the connector is connected, and if it is then to which pin? And accordingly, I need to perform some operations. (Which I can manage by myself).
The difficulty I'm facing is reading from a specific line and reading the blank.
Read the lines using fgets() or a relative. Then use sscanf() on the line, checking to see whether there were one or two successful conversions (the return value). If there's one conversion, the second value was empty or missing; if two, then you have both numbers safely.
Note that fscanf() and relatives will read past newlines unless you're careful, so they do not provide the information you need.
so your file is more like this
Col1 col2 \n
r1val1 r1val2\n
.
.
and so on,if this is the case then use fscanf() to read the string (until \n)from the file.Then use strtok() function to break the string into tokens ,here is the tutorial of the same
http://www.gnu.org/s/hello/manual/libc/Finding-Tokens-in-a-String.html
hope this helps...
one more humble suggestion..just work on c programming first if you are a newbie,don't directly go for microcontrollers,as there are lots of things that you might understand in a wrong way if you dont know some of the basic concepts...
This is a common problem in C. When line boundaries carry meaning in the grammar, it's difficult to directly read the file using only the scanf()-family functions.
Just read each line with fgets(3) and then run sscanf() on one line at a time. By doing this you won't incorrectly jump ahead to read the next line's first column.
Since there are two values on a line you can parse the first, find the next whitespace, then parse the next looking for it's absence as well. I say parse rather than scanf() as when I really want control, or have a huge volume of numbers to scan, I use calls in the strtol() family.

Parsing a stream of data for control strings

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

Resources