I have just started learning C after coding for some while in Java and Python.
I was wondering how I could "validate" a string input (if it stands in a certain criteria) and I stumbled upon the sscanf() function.
I had the impression that it acts kind of similarly to regular expressions, however I didn't quite manage to tell how I can create rather complex queries with it.
For example, lets say I have the following string:
char str[]={"Santa-monica 123"}
I want to use sscanf() to check if the string has only letters, numbers and dashes in it.
Could someone please elaborate?
The fact that sscanf allows something that looks a bit like a character class by no means implies that it is anything at all like a regular expression library. In fact, Posix doesn't even require the scanf functions to accept character ranges inside character classes, although I suspect that it will work fine on any implementation you will run into.
But the scanning problem you have does not require regular expressions, either. All you need is a repeated character class match, and sscanf can certainly do that:
#include <stdbool.h>
bool check_string(const char* s) {
int n = 0;
sscanf(s, "%*[-a-zA-Z0-9]%n", &n);
return s[n] == 0;
}
The idea behind that scanf format is that the first conversion will match and discard the longest initial sequence consisting of valid characters. (It might fail if the first character is invalid. Thanks to #chux for pointing that out.) If it succeeds, it will then set n to the current scan point, which is the offset of the next character. If the next character is a NUL, then all the characters were good. (This version returns OK for the empty string, since it contains no illegal characters. If you want the empty string to fail, change the return condition to return n && s[n] == 0;)
You could also do this with the standard regex library (or any more sophisticated library, if you prefer, but the Posix library is usually available without additional work). This requires a little bit more code in order to compile the regular expression. For efficiency, the following attempts to compile the regex only once, but for simplicity I left out the synchronization to avoid data races during initialization, so don't use this in a multithreaded application.
#include <regex.h>
#include <stdbool.h>
bool check_string(const char* s) {
static regex_t* re_ptr = NULL;
static regex_t re;
if (!re_ptr) regcomp((re_ptr = &re), "^[[:alnum:]-]*$", REG_EXTENDED);
return regexec(re_ptr, s, 0, NULL, 0) == 0;
}
I want to use sscanf() to check if the string has only letters, numbers and dashes in it.
Variation of #rici good answer.
Create a scanset for letters, numbers and dashes.
//v The * indicates to scan, but not save the result.
// v Dash (or minus sign), best to list first.
"%*[-0-9A-Za-z]"
// ^^^^^^ Letters a-z, both cases
// ^^^ Digits
Use "%n" to detect how far the scan went.
Now we can use determine if
Scanning stop due to a null character (the whole string is valid)
Scanning stop due to an invalid character
int n = 0;
sscanf(str, "%*[-0-9A-Za-z]%n", &n);
bool success = (str[n] == '\0');
sscanf does not have this functionality, the argument you are referring to is a format specifier and not used for validation. see here: https://www.tutorialspoint.com/c_standard_library/c_function_sscanf.htm
as also mentioned sscanf is for a different job. for more in formation see this link. You can loop over string using isalpha and isdigit to check if chars in string are digits and alphabetic characters or no.
char str[]={"Santa-monica 123"}
for (int i = 0; str[i] != '\0'; i++)
{
if ((!isalpha(str[i])) && (!isdigit(str[i])) && (str[i] != '-'))
printf("wrong character %c", str[i]);//this will be printed for spaces too
}
I want to ... check if the string has only letters, numbers and dashes in it.
In C that's traditionally done with isalnum(3) and friends.
bool valid( const char str[] ) {
for( const char *p = str; p < str + strlen(str); p++ ) {
if( ! (isalnum(*p) || *p == '-') )
return false;
}
return true;
}
You can also use your friendly neighborhood regex(3), but you'll find that requires a surprising amount of code for a simple scan.
After retrieving value on sscanf(), you may use regular expression to validate the value.
Please see Regular Expression ic C
Related
Pre-History:
I had the issue, that the getchar() function did not get processed in the right way as there was not a request for any given input and the program just have continued processing further.
I searched the internet about what this issue could be and found the information that if the scanf() function is implemented into a program before the getchar() function, the getchar() function does not behave in the right way, and would act like my issue was.
Citation:
I will bet you ONE HUNDRED DOLLARS you only see this problem when the call to getchar() is preceded by a scanf().
Don't use scanf for interactive programs. There are two main reasons for this:
1) scanf can't recover from malformed input. You have to get the format string right, every time, or else it just throws away whatever input it couldn't match and returns a value indicating failure. This might be fine if you're parsing a fixed-format file when poor formatting is unrecoverable anyway, but it's the exact opposite of what you want to do with user input. Use fgets() and sscanf(), fgets() and strtok(), or write your own user input routines using getchar() and putchar().
1.5) Even properly used, scanf inevitably discards input (whitespace) that can sometimes be important.
2) scanf has a nasty habit of leaving newlines in the input stream. This is fine if you never use anything but scanf, since scanf will usually skip over any whitespace characters in its eagerness to find whatever it's expecting next. But if you mix scanf with fgets/getchar, it quickly becomes a total mess trying to figure out what might or might not be left hanging out in the input stream. Especially if you do any looping -- it's quite common for the input stream to be different on the first iteration, which results in a potentially weird bug and even weirder attempts to fix it.
tl;dr -- scanf is for formatted input. User input is not formatted. //
Here is the link, to that thread: https://bbs.archlinux.org/viewtopic.php?id=161294
scanf() with:
scanf("%x",integer_variable);
seems for me as a newbie to the scene as the only way possible to input a hex number from the keyboard (or better said the stdin file) and store it to a int variable.
Is there a different way to input a hex value from the stdin and store it into an integer variable?
Bonus challenge: It would be nice also, if i could write negative values (through negative hex input of course) into an signed int variable.
INFO: I have read many threads for C here on Stackoverflow about similar problems but none of those answer my explicit question quite well. So i´ve posted this question.
I work under Linux Ubuntu.
The quote about the hundred dollar bet is accurate. Mixing scanf with getchar is almost always a bad idea; it almost always leads to trouble. It's not that they can't be used together, though. It's possible to use them together -- but usually, it's just way too difficult. There are too many fussy little details and "gotcha!"s to keep track of. It's more trouble than it's worth.
At first you had said
scanf() with ... %d ... seems for me as a newbie to the scene as the only way possible to input a hex number from the keyboard
There was some side confusion there, because of course %d is for decimal input. But since I'd written this answer by the time you corrected that, let's proceed with decimal for the moment.
(Also for the moment I'm leaving out error checking -- that is, these code fragments don't check for or do anything graceful if the user doesn't type the requested number.) Anyway, here are several ways of reading an integer:
scanf("%d", &integer_variable);
You're right, this is the (superficially) easiest way.
char buf[100];
fgets(buf, sizeof(buf), stdin);
integer_variable = atoi(buf);
This is, I think, the easiest way that doesn't use scanf. But most people these days frown on using atoi, because it doesn't do much useful error checking.
char buf[100];
fgets(buf, sizeof(buf), stdin);
integer_variable = strtol(buf, NULL, 10);
This is almost the same as before, but avoids atoi in favor of the preferred strtol.
char buf[100];
fgets(buf, sizeof(buf), stdin);
sscanf(buf, "%d", &integer_variable);
This reads a line and then uses sscanf to parse it, another popular and general technique.
All of these will work; all of these will handle negative numbers. It's important to think about error conditions, though -- I'll have more to say about that later.
If you want to input hexadecimal numbers, the techniques are similar:
scanf("%x", &integer_variable);
char buf[100];
fgets(buf, sizeof(buf), stdin);
integer_variable = strtol(buf, NULL, 16);
char buf[100];
fgets(buf, sizeof(buf), stdin);
sscanf(buf, "%x", &integer_variable);
These should all work, too. I wouldn't necessarily expect them to handle "negative hexadecimal", though, because that's an unusual requirement. Most of the time, hexadecimal notation is used for unsigned integers. (In fact, strictly speaking, %x with scanf and sscanf must be used with an integer_variable that has been declared as unsigned int, not plain int.)
Sometimes it's useful or necessary to do this sort of thing "by hand". Here's a code fragment that reads exactly two hexadecimal digits. I'll start out with the version using getchar:
int c1 = getchar();
if(c1 != EOF && isascii(c1) && isxdigit(c1)) {
int c2 = getchar();
if(c2 != EOF && isascii(c2) && isxdigit(c2)) {
if(isdigit(c1)) integer_variable = c1 - '0';
else if(isupper(c1)) integer_variable = 10 + c1 - 'A';
else if(islower(c1)) integer_variable = 10 + c1 - 'a';
integer_variable = integer_variable * 16;
if(isdigit(c2)) integer_variable += c2 - '0';
else if(isupper(c2)) integer_variable += 10 + c2 - 'A';
else if(islower(c2)) integer_variable += 10 + c1 - 'a';
}
}
As you can see, it's a bit of a jawbreaker. Me, although I almost never use members of the scanf family, this is one place where I sometimes do, precisely because doing it "by hand" is so much work. You can simplify it considerably by using an auxiliary function or macro to do the digit conversion:
int c1 = getchar();
if(c1 != EOF && isascii(c1) && isxdigit(c1)) {
int c2 = getchar();
if(c2 != EOF && isascii(c2) && isxdigit(c2)) {
integer_variable = Xctod(c1);
integer_variable = integer_variable * 16;
integer_variable += Xctod(c2);
}
}
Or you could collapse those inner expressions down to just
integer_variable = 16 * Xctod(c1) + Xctod(c2);
These work in terms of an auxiliary function:
int Xctod(int c)
{
if(!isascii(c)) return 0;
else if(isdigit(c)) return c - '0';
else if(isupper(c)) return 10 + c - 'A';
else if(islower(c)) return 10 + c - 'a';
else return 0;
}
Or perhaps a macro (though this is definitely an old-school sort of thing):
#define Xctod(c) (isdigit(c) ? (c) - '0' : (c) - (isupper(c) ? 'A' : 'a') + 10)
Often I'm parsing hexadecimal digits like this not from stdin using getchar(), but from a string. Often I'm using a character pointer (char *p) to step through the string, meaning that I end up with code more like this:
char c1 = *p++;
if(isascii(c1) && isxdigit(c1)) {
char c2 = *p++;
if(isascii(c2) && isxdigit(c2))
integer_variable = 16 * Xctod(c1) + Xctod(c2);
}
It's tempting to omit the temporary variables and the error checking and boil this down still further:
integer_variable = 16 * Xctod(*p++) + Xctod(*p++);
But don't do this! Besides the lack of error checking, this expression is probably undefined, and it definitely won't always do what you want, because there's no longer any guarantee abut what order you read the characters in. If you know p points at the first of two hex digits, you don't want to collapse it any further than
integer_variable = Xctod(*p++);
integer_variable = 16 * integer_variable + Xctod(*p++);
and even then, this will work only with the function version of Xctod, not the macro, since the macro evaluates its argument multiple times.
Finally, let's talk abut error handling. There are quite a few possibilities to worry about:
The user hits Return without typing anything.
The user types whitespace before or after the number.
The user types extra garbage after the number.
The user types non-numeric input instead of a number.
The code hits end-of-file; there are no characters to read at all.
And then how you handle these depends on what input techniques you're using. Here are the basic rules:
A. If you're calling scanf, fscanf, or sscanf, always check the return value. If it's not 1 (or, in the case where you had multiple % specifiers, it's not the number of values you expected to read), it means something went wrong. This will generally catch problems 4 and 5, and will handle case 2 gracefully. But it will often quietly ignore problems 1 and 3. (In particular, scanf and fscanf treat an extra \n just like leading whitespace.)
B. If you're calling fgets, again, always check the return value. You'll get NULL on EOF (problem 5). Handling the other problems depends on what you do with the line you read.
C. If you're calling atoi, it will deal gracefully with problem 2, but it will ignore problem 3, and it will quietly turn problem 4 into the number 0 (which is why atoi is usually not recommended any more).
D. If you're calling strtol or any of the other "strto" functions, they will deal gracefully with problem 2, and if you let them give you back an "end pointer", you can check for and deal with problems 3 and 4. (Note that I left the end-pointer handling out of my two strtol examples above.)
E. Finally, if you're doing something down-and-dirty like my "hardway" two-digit hex converter, you generally have to take care of all these problems, explicitly, yourself. If you want to skip leading whitespace you have to do so (the isspace function from <ctype.h> can help), and if there might be unexpected non-digit characters, you have to check for those, too. (That's what the calls to isascii and isxdigit are doing in my "hardway" two-digit hex converter.)
Per scanf man page, you can use scanf to read hex number from stdin into (unsigned) integer variable.
unsigned int v ;
if ( scanf("%x", &v) == 1 ) {
// do something with v.
}
As per man page, %x is always unsigned. If you want to support negative values, you will have to add explicit logic.
As mentioned in the link you posted, using fgets and sscanf is the best way to handle this. fgets will read a full line of text and sscanf will parse the line.
For example
char line[100];
fgets(line, sizeof(line), stdin);
int x;
int rval = sscanf(line, "%x", &x);
if (rval == 1) {
printf("read value %x\n", x);
} else {
printf("please enter a hexadecimal integer\n");
}
Since you're only reading in a single integer, you could also use strtol instead of sscanf. This also has the advantage of detecting if any additional characters were entered:
char *ptr;
errno = 0;
long x = strtol(line, &ptr, 16);
if (errno) {
perror("parsing failed");
} else if (*ptr != '\n' && *ptr != 0) {
printf("extra characters entered: %s\n", ptr);
} else {
printf("read value %lx\n", x);
}
I'm new to C, taking a university course.
In one of the tasks I'm given, I deal with strings. I take strings either entered by user or parsed from a file and then use a function on them to produce an answer (if a specific quality exists).
The string can be of variable length but it is acceptable to assume that their maximum length is 80 characters.
I created the program using a
char s[81];
and then filling up the same array with the different strings each time.
Since the string has to be null-terminated I just added a '\0' at index 80;
s[80] = '\0';
But then I got all kind of weird behaviors - Unrelated characters at the end of the string I entered. I assumed this is because there was space between the end of the 'real' characters and the '\0' character filled with garbage(?).
So what I did is I created a function:
void clean_string(char s[], int string_size) {
int index = 0;
while(index < string_size) {
s[index++] = '\0';
}
}
What I call clean, is just filling a string up with zero characters. I do this every time I am done dealing with a string and ready to accept a new one. Then I fill up the string again character by character and when ever I'll stop, the following character will be a '\0' for sure.
To not include any magic numbers in code (81 each time I call clean_string) I used the following:
#define STRING_LENGTH 81
That works for me. The strings show no strange behavior. But I wondered if this is considered bad practice. Are there problems with this approach?
Just emphasizing, I'm not asking for help in the assignment itself, but tips on how to approach these kind of situations better.
Rather than prefilling the entire array with zeros, it should be simple to just add a single zero after you've read all relevant characters.
For example:
char s[STRING_LENGTH];
int c;
int idx = 0;
while (((c = getchar()) != '\n') && (idx < STRING_LENGTH - 1) && (c != EOF)) {
s[idx++] = c;
}
s[idx] = 0;
I've got an UTF-8 text file containing several signs that i'd like to change by other ones (only those between |( and |) ), but the problem is that some of these signs are not considered as characters but as multi-character signs. (By this i mean they can't be put between '∞' but only like this "∞", so char * ?)
Here is my textfile :
Text : |(abc∞∪v=|)
For example :
∞ should be changed by ¤c
∪ by ¸!
= changed by "
So as some signs(∞ and ∪) are multicharacters, i decided to use fscanf to get all the text word by word. The problem with this method is that I have to put space between each character ... My file should look like this :
Text : |( a b c ∞ ∪ v = |)
fgetc can't be used because characters like ∞ can't be considered as one single character.If i use it I won't be able to strcmp a char with each sign (char * ), i tried to convert my char to char* but strcmp !=0.
Here is my code in C to help you understanding my problem :
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main(void){
char *carac[]={"∞","=","∪"}; //array with our signs
FILE *flot,*flot3;
flot=fopen("fichierdeTest2.txt","r"); // input text file
flot3=fopen("resultat.txt","w"); //output file
int i=0,j=0;
char a[1024]; //array that will contain each read word.
while(!feof(flot))
{
fscanf(flot,"%s",&a[i]);
if (strstr(&a[i], "|(") != NULL){ // if the word read contains |( then j=1
j=1;
fprintf(flot3,"|(");
}
if (strcmp(&a[i], "|)") == 0)
j=0;
if(j==1) { //it means we are between |( and |) so the conversion can begin
if (strcmp(carac[0], &a[i]) == 0) { fprintf(flot3, "¤c"); }
else if (strcmp(carac[1], &a[i]) == 0) { fprintf(flot3,"\"" ); }
else if (strcmp(carac[2], &a[i]) == 0) { fprintf(flot3, " ¸!"); }
else fprintf(flot3,"%s",&a[i]); // when it's a letter, number or sign that doesn't need to be converted
}
else { // when we are not between |( and |) just copy the word to the output file with a space after it
fprintf(flot3, "%s", &a[i]);
fprintf(flot3, " ");
}
i++;
}
}
Thanks a lot for the future help !
EDIT : Every sign will be changed correctly if i put a space between each them but without ,it won't work, that's what i'm trying to solve.
First of all, get the terminology right. Proper terminology is a bit confusing, but at least other people will understand what you are talking about.
In C, char is the same as byte. However, a character is something abstract like ∞ or ¤ or c. One character may contain a few bytes (that is a few chars). Such characters are called multi-byte ones.
Converting a character to a sequence of bytes (encoding) is not trivial. Different systems do it differently; some use UTF-8, while others may use UTF-16 big-endian, UTF-16 little endian, a 8-bit codepage or any other encoding.
When your C program has something inside quotes, like "∞" - it's a C-string, that is, several bytes terminated by a zero byte. When your code uses strcmp to compare strings, it compares each byte of both strings, to make sure they are equal. So, if your source code and your input file use different encodings, the strings (byte sequences) won't match, even though you will see the same character when examining them!
So, to rule out any encoding mismatches, you might want to use a sequence of bytes instead of a character in your source code. For example, if you know that your input file uses the UTF-8 encoding:
char *carac[]={
"\xe2\x88\x9e", // ∞
"=",
"\xe2\x88\xaa"}; // ∪
Alternatively, make sure the encodings (of your source code and your program's input file) are the same.
Another, less subtle, problem: when comparing strings, you actually have a big string and a small string, and you want to check whether the big string starts with the small string. Here strcmp does the wrong thing! You must use strncmp here instead:
if (strncmp(carac[0], &a[i], strlen(carac[0])) == 0)
{
fprintf(flot3, "\xC2\xA4""c"); // ¤c
}
Another problem (actually, a major bug): the fscanf function reads a word (text delimited by spaces) from the input file. If you only examine the first byte in this word, the other bytes will not be processed. To fix, make a loop over all bytes:
fscanf(flot,"%s",a);
for (i = 0; a[i] != '\0'; )
{
if (strncmp(&a[i], "|(", 2)) // start pattern
{
now_replacing = 1;
i += 2;
continue;
}
if (now_replacing)
{
if (strncmp(&a[i], whatever, strlen(whatever)))
{
fprintf(...);
i += strlen(whatever);
}
}
else
{
fputc(a[i], output);
i += 1; // processed just one char
}
}
You're on the right track, but you need to look at characters differently than strings.
strcmp(carac[0], &a[i])
(Pretending i = 2) As you know this compares the string "∞" with &a[2]. But you forget that &a[2] is the address of the second character of the string, and strcmp works by scanning the entire string until it hits a null terminator. So "∞" actually ends up getting compared with "abc∞∪v=|)" because a is only null terminated at the very end.
What you should do is not use strings, but expand each character (8 bits) to a short (16 bits). And then you can compare them with your UTF-16 characters
if( 8734 = *((short *)&a[i])) { /* character is infinity */ }
The reason for that 8734 is because that's the UTF16 value of infinity.
VERY IMPORTANT NOTE:
Depending if your machine is big-endian or little-endian matters for this case. If 8734 (0x221E) does not work, give 7714 (0x1E22) a try.
Edit Something else I overlooked is you're scanning the entire string at once. "%s: String of characters. This will read subsequent characters until a whitespace is found (whitespace characters are considered to be blank, newline and tab)." (source)
//feof = false.
fscanf(flot,"%s",&a[i]);
//feof = ture.
That means you never actually iterate. You need to go back and rethink your scanning procedure.
What is the simplest way to detect a substring in a specific format?
For example, consider the string in C
"[random characters/symbols] a-b-c [random characters/symbols]"
Is there a function in C that allows me to detect the substring in the format "%s-%s-%s"?
Try starting at various points within the string until success.
"%*[^- ] look for a sub-string that does not contain a '-' nor space.
"%n Record the offset in the scan.
#include<stdio.h>
int main(void) {
char *s = "[random characters/symbols] a-b-c [random characters/symbols]";
while (*s) {
int n = 0;
sscanf(s, "%*[^- ]-%*[^- ]-%*[^- ]%n", &n);
if (n) {
printf("Success '%.*s'\n", n, s);
break;
}
s++;
}
return 0;
}
Output
Success 'a-b-c'
Use strchr() or strnchr() if you have it to detect a literal string (no pattern matching). The function strnchr() is better because you can specify a max length to protect against a string with a missing null terminator; but, it is not ANSI so not all languages have it. If you use strchr() make sure you protect against a missing null terminator.
You can use regcomp() to do a regular expressions search the string.
See regex in C language using functions regcomp and regexec toggles between first and second match
I'm making a calc function which is meant to check if the input is valid. So, I'll have 2 strings, one with what the user inputs (eg, 3+2-1 or maybe dog - which will be invalid), and one with the ALLOWED characters stored in a string, eg '123456789/*-+.^' .
I'm not sure how can I do this and have trouble getting it started. I know a few functions such as STRMCP, and the popular ones from the string.h file, but I have no idea how to use them to check every input.
What is the most simplest way to do this?
One way of proceeding is the following.
A string is an array of ascii codes. So if your string is
char formula[50];
then you have a loop
int n =0;
while (formula[n]!=0)
{
if ( (formula[n]<........<<your code here>> ))
{printf("invalid entry\n\n"); return -1; //-1 = error code
n++;
}
you need to put the logic into the loop, but you can test the ascii codes of each character with this loop.
There may be a more elegant way of solving this, but this will work if you put the correct conditional statement here to check the ascii code of each character.
The while statement checks to see ifyou got to the end of the string.
Here's a demonstration of how use strpbrk() to check all characters in a string are in your chosen set:
#include <string.h>
#include <stdio.h>
const char alphabet[] = "123456789/*+-=.^";
int main(void) {
const char a[] = "3+2-1";
const char b[] = "dog";
char *res = strpbrk(a, alphabet);
printf("%s %s\n", a, (res) ? "true" : "false");
res = strpbrk(b, alphabet);
printf("%s %s\n", b, (res) ? "true" : "false");
return 0;
}
That's not the fastest way to do this, but it's very easy to use.
However, if you are writing a calculator function, you really want to parse the string at the same time. A typical strategy would be to have two types of entity - operators (+-/*^) and operands (numbers, so -0.1, .0002, 42, etc). You would extract these from the string as you parse it, and just fail if you hit an invalid character. (If you need to handle parentheses, you'll need a stack for the parsing.... and you'll likely need to work with a stack anyway to process and evaluate the expression overall.)