What's the easiest way to parse a string in C? - c

I have to parse this string in C:
XFR 3 NS 207.46.106.118:1863 0 207.46.104.20:1863\r\n
And be able to get the 207.46.106.118 part and 1863 part (the first ip address).
I know I could go char by char and eventually find my way through it, but what's the easiest way to get this information, given that the IP address in the string could change to a different format (with less digits)?

You can use sscanf() from the C standard lib. Here's an example of how to get the ip and port as strings, assuming the part in front of the address is constant:
#include <stdio.h>
int main(void)
{
const char *input = "XFR 3 NS 207.46.106.118:1863 0 207.46.104.20:1863\r\n";
const char *format = "XFR 3 NS %15[0-9.]:%5[0-9]";
char ip[16] = { 0 }; // ip4 addresses have max len 15
char port[6] = { 0 }; // port numbers are 16bit, ie 5 digits max
if(sscanf(input, format, ip, port) != 2)
puts("parsing failed");
else printf("ip = %s\nport = %s\n", ip, port);
return 0;
}
The important parts of the format strings are the scanset patterns %15[0-9.] and %5[0-9], which will match a string of at most 15 characters composed of digits or dots (ie ip addresses won't be checked for well-formedness) and a string of at most 5 digits respectively (which means invalid port numbers above 2^16 - 1 will slip through).

Depends on what defines the format of the document. In this case, it may be as simple as tokenizing the string and looking through the tokens for what you want. Simply use strtok and split on spaces to grab the 207.46.106.118:1863 and then you can tokenize that again (or simply scan for the : manually) to get the proper components.

You could use strtok to tokenize breaking on space, or you could use one of the scanf family to pull out data as well.
There is a big caveat in all of this though, these are functions that are notorious for security and mishandling bad input. YMMV.

Loop through until you get the first '.', and loop back until you find ' '. The loop forward until you find ':', building sub-strings every time you meet '.' or ':'. You can check the number of substrings and their lengths as simple error checking. Then loop until you find a ' ' and you have the 1863 part.
This would be robust if the beginning of the string doesn't vary much. And also very easy. You could make it even simpler if the string always begins with "XFR 3 NS ".

In this case, strok() is of trivial use and would be my choice. For safety, you might count the ':' in your string and proceed if there is exactly one ':'.

If the strings to be parsed are well-formatted then I'd go with Daniel and Ukko's suggestion to use strtok().
A word of warning though: strtok() modifies the string that it parses. Not always what you want.

This may be overkill, since you said you didn't want to use a regex library, but the re2c program will give you regex parsing without the library: it generates the DFSM for a regular expression as C code. The regexps are specified in comments embedded in C code.
And what seems like overkill now may become a comfort to you later should you have to parse the rest of the string; it is a lot easier to modify a few regexps to adjust or add new syntax than to modify a bunch of ad hoc tokenizing code. And it makes the structure of what you are parsing a lot clearer in your code.

Related

How can I use sscanf to analyze string data?

How do I split a string into two strings (array name, index number) only if the string is matching the following string structure: "ArrayName[index]".
The array name can be 31 characters at most and the index 3 at most.
I found the following example which suppose to work with "Matrix[index1][index2]". I really couldn't understand how it does it in order to take apart the part I need to get my strings.
sscanf(inputString, "%32[^[]%*[[]%3[^]]%*[^[]%*[[]%3[^]]", matrixName, index1,index2) == 3
This try over here wasn't a success, what am I missing?
sscanf(inputString, "%32[^[]%*[[]%3[^]]", arrayName, index) == 2
How do I split a string into two strings (array name, index number) only if the string is matching the following string structure: "ArrayName[index]".
With sscanf, you don't. Not if you mean that you can rely on nothing being modified in the event that the input does not match the pattern. This is because sscanf, like the rest of the scanf family, processes its input and format linearly, without backtracking, and by design it fills input fields as they are successfully matched. Thus, if you scan with a format that assigns multiple fields or has trailing literal characters then it is possible for results to be stored for some fields despite a matching failure occurring.
But if that's ok with you then #gsamaras's answer provides a nearly-correct approach to parsing and validating a string according to your specified format, using sscanf. That answer also presents a nice explanation of the meaning of the format string. The problem with it is that it provides no way to distinguish between the input fully matching the format and the input failing to match at the final ], or including additional characters after.
Here is a variation on that code that accounts for those tail-end issues, too:
char array_name[32] = {0}, idx[4] = {0}, c = 0;
int n;
if (sscanf(str, "%31[^[][%3[^]]%c%n", array_name, idx, &c, &n) >= 3
&& c == ']' && str[n] == '\0')
printf("arrayName = %s\nindex = %s\n", array_name, idx);
else
printf("Not in the expected format \"ArrayName[idx]\"\n");
The difference in the format is the replacement of the literal terminating ] with a %c directive, which matches any one character, and the addition of a %n directive, which causes the number of characters of input read so far to be stored, without itself consuming any input.
With that, if the return value is at least 3 then we know that the whole format was matched (a %n never produces a matching failure, but docs are unclear and behavior is inconsistent on whether it contributes to the returned field count). In that event, we examine variable c to determine whether there was a closing ] where we expected to find one, and we use the character count recorded in n to verify that all characters of the string were parsed (so that str[n] refers to a string terminator).
You may at this point be wondering at how complicated and cryptic that all is. And you would be right to do so. Parsing structured input is a complicated and tricky proposition, for one thing, but also the scanf family functions are pretty difficult to use. You would be better off with a regex matcher for cases like yours, or maybe with a machine-generated lexical analyzer (see lex), possibly augmented by machine-generated parser (see yacc). Even a hand-written parser that works through the input string with string functions and character comparisons might be an improvement. It's still complicated any way around, but those tools can at least make it less cryptic.
Note: the above assumes that the index can be any string of up to three characters. If you meant that it must be numeric, perhaps specifically a decimal number, perhaps specifically non-negative, then the format can be adjusted to serve that purpose.
A naive example to get you started:
#include <stdio.h>
#include <string.h>
int main(void)
{
char str[] = "myArray[123]";
char array_name[32] = {0}, idx[4] = {0};
if(sscanf(str, "%31[^[][%3[^]]]", array_name, idx) == 2)
printf("arrayName = %s\nindex = %s\n", array_name, idx);
else
printf("Not in the expected format \"ArrayName[idx]\"\n");
return 0;
}
Output:
arrayName = myArray
index = 123
which will find easy not-in-the-expected format cases, such as "ArrayNameidx]" and "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP[idx]", but not "ArrayName[idx".
The essence of sscanf() is to tell it where to stop, otherwise %s would read until the next whitespace.
This negated scanset %[^[] means read until you find an opening bracket.
This negated scanset %[^]] means read until you find a closing bracket.
Note: I used 31 and 3 as the width specifiers respectively, since we want to reserve the last slot for the NULL terminator, since the name of the array is assumed to be 31 characters at the most, and the index 3 at the most. The size of the array for its token is the max allowed length, plus one.
How can I use sscanf to analyze string data?
Use "%n" to detect a completed scan.
array name can be 31 characters at most and the index 3 at most.
For illustration, let us assume the index needs to limit to a numeric value [0 - 999].
Use string literal concatenation to present the format more clearly.
char name[32]; // array name can be 31 characters
#define NAME_FMT "%31[^[]"
char idx[4]; //
#define IDX_FMT "%3[0-9]"
int n = 0; // be sure to initialize
sscanf(str, NAME_FMT "[" IDX_FMT "]" "%n", array_name, idx, &n);
// Did scan complete (is `n` non-zero) with no extra text?
if (n && str[n] == '\0') {
printf("arrayName = %s\nindex = %d\n", array_name, atoi(idx));
} else {
printf("Not in the expected format \"ArrayName[idx]\"\n");
}

removing multi-char constants in C

Here's some code I found in a very old C library that's trying to eat whitespace from a file...
while(
(line_buf[++line_idx] != ' ') &&
(line_buf[ line_idx] != ' ') &&
(line_buf[ line_idx] != ',') &&
(line_buf[ line_idx] != '\0') )
{
This great thread explains what the problem is, but most of the answers are "just ignore it" or "you should never do this". What I don't see, however, is the canonical solution. Can anyone offer a way to code this test using the "proper way"?
UPDATE: to clarify, the question is "what is the proper way to test for the presence of a string of one or more characters at a given index in another string". Forgive me if I am using the wrong terminology.
Original question
There is no canonical or correct way. Multi-character constants have always been implementation defined. Look up the documentation for the compiler used when the code was written and figure out what was meant.
Updated question
You can match multiple characters using strchr().
while (strchr( " ,", line_buf[++line_idx] ))
{
Again, this does not account for that multi-char constant. You should figure out why that was there before simply removing it.
Also, strchr() does not handle Unicode. If you are dealing with a UTF-8 stream, for example, you will need a function capable of handling it.
Finally, if you are concerned about speed, profile. The compiler might get you better results using the three (or four) individual test expressions in the ‘while’ condition.
In other words, the multiple tests might be the best solution!
Beyond that, I smell some uncouth indexing: the way that line_idx is updated depends on the surrounding code to actuate the loop properly. Make sure that you don’t create an off-by-one error when you update stuff.
Good luck!
UPDATE: to clarify, the question is "what is the proper way to test
for the presence of a string of one or more characters at a given
index in another string". Forgive me if I am using the wrong
terminology.
Well, there are a number of ways, but the standard way is using strspn which has the prototype:
size_t strspn(const char *s, const char *accept);
and it cleverly:
calculates the length (in bytes) of the initial segment of s
which consists entirely of bytes in accept.
This allows you to test for the "the presence of a string of one or more characters at a given index in another string" and tells you how many of the characters from that string were sequentially matched.
For example, if you had another string say char s = "somestring"; and wanted to know if it contained the letters r, s, t, say, in char *accept = "rst"; beginning at the 5th character, you could test:
size_t n;
if ((n = strspn (&s[4], accept)) > 0)
printf ("matched %zu chars from '%s' at beginning of '%s'\n",
n, accept, &s[4]);
To compare in order, you can use strncmp (&s[4], accept, strlen (accept));. You can also simply use nestest loops to iterate over s with the characters in accept.
All of the ways are "proper", so long as they do not invoke Undefined Behavior (and are reasonable efficient).

C Trying to match the exact substring and nothing more

I have tried different functions including strtok(), strcmp() and strstr(), but I guess I'm missing something. Is there a way to match the exact substring in a string?
For example:
If I have a name: "Tan"
And I have 2 file names: "SomethingTan5346" and "nothingTangyrs634"
So how can I make sure that I match the first string and not both? Because the second file is for the person Tangyrs. Or is it impossible with this approach? Am I going at it the wrong way?
If, as seems to be the case, you just want to identify strings that have your text but are immediately followed by a digit, your best bet is probably to get yourself a good regular expression implementation and just search for Tan[0-9].
It could be done simply be using strstr() to find the string then checking the character following that with isnum() but the actual code to do that would be:
not as easy as you think since you may have to do multiple searchs (e.g., TangoTangoTan42 would need three checks); and
inadvisable if there's the chance the searches my become more complex (such as Tan followed by 1-3 digits or exactly two # characters and an X).
A regular expression library will make this much easier, provided you're willing to invest a little effort into learning about it.
If you don't want to invest the time in learning regular expressions, the following complete test program should be a good starting point to evaluate a string based on the requirements in the first paragraph:
#include <stdio.h>
#include <string.h>
#include <ctype.h>
int hasSubstrWithDigit(char *lookFor, char *searchString) {
// Cache length and set initial search position.
size_t lookLen = strlen(lookFor);
char *foundPos = searchString;
// Keep looking for string until none left.
while ((foundPos = strstr(foundPos, lookFor)) != NULL) {
// If at end, no possibility of following digit.
if (strlen(foundPos) == lookLen) return 0;
// If followed by digit, return true.
if (isdigit(foundPos[lookLen])) return 1;
// Otherwise keep looking, from next character.
foundPos++;
}
// Not found, return false.
return 0;
}
int main(int argc, char *argv[]) {
if (argc < 3) {
printf("Usage testprog <lookFor> <searchIn>...\n");
return 1;
}
for (int i = 2; i < argc; ++i) {
printf("Result of looking for '%s' in '%s' is %d\n", argv[1], argv[i], hasSubstrWithDigit(argv[1], argv[i]));
}
return 0;
}
Though, as you can see, it's not as elegant as a regex search, and is likely to become even less elegant if your requirements change :-)
Running that with:
./testprog Tan xyzzyTan xyzzyTan7 xyzzyTangy4 xyzzyTangyTan12
shows it is action:
Result of looking for 'Tan' in 'xyzzyTan' is 0
Result of looking for 'Tan' in 'xyzzyTan7' is 1
Result of looking for 'Tan' in 'xyzzyTangy4' is 0
Result of looking for 'Tan' in 'xyzzyTangyTan12' is 1
The solution depends on your definition of exact matching.
This might be useful for you:
Traverse all matches of the target substring.
C find all occurrences of substring
Finding all instances of a substring in a string
find the count of substring in string
https://cboard.cprogramming.com/c-programming/73365-how-use-strstr-find-all-occurrences-substring-string-not-only-first.html
etc.
Having the span of the match, verify that the previous and following characters match/do not match your criterion for "exact match".
Or,
You could take advantage of regex in C++ (I know the tag is "C"), with #include <regex>, or POSIX #include <regex.h>.
You may want to use strstr(3) to search a substring in a string, strchr(3) to search a character in a string, or even regular expressions with regcomp(3).
You should read more about parsing techniques, notably about recursive descent parsers. In some cases, sscanf(3) with %n can also be handy. You should take care of the return count.
You could loop to read then parse every line, perhaps using getline(3), see this.
You need first to document your input file format (or your file name conventions, if SomethingTan5346 is some file path), perhaps using EBNF notation.
(you probably want to combine several approaches I am suggesting above)
BTW, I recommend limiting (for your convenience) file paths to a restricted set of characters. For example using * or ; or spaces or tabs in file paths is possible (see path_resolution(7)) but should be frowned upon.

Identyfying prefix in the same string as a suffix

Eg-
maabcma is valid because it contains ma as a proper prefix as well as a proper suffix.
panaba is not.
How do I find out if a word is valid or not as above in C language?
I'm not very good at string operations. So, please help me out with a pseudocode.
Thanks in advance.
I'm completely lost. T=number of test cases.
EDIT: New code. My best code so far-
#include<stdio.h>
#include<string.h>
void main()
{
int i,T,flag=0;
int j,k,len=0;
char W[10],X[10];
scanf("%d",&T);
for(i=0;i<T;i++)
{
scanf("%s",W);
for(len=0;W[len]!='\0';len++)
X[len]=W[len];
X[len]='\0';
for(j=len-1;j>=0;j--)
for(k=0;k<len;k++)
{
if(X[k]!=W[j])
flag=0;
else if((j-k)==(len-1))
flag==1;
}
if (flag == 1)
printf("NICE\n");
else
printf("NOT\n");
}
}
Still not getting the proper results. Where am I going wrong?
The thing is you are only setting the value of flag if a match exists, otherwise you must set it to 0. because see, if I have:
pammbap
my prefix is pam and suffix is bap.
According to the final for loop,
p and a match so flag is set to 1.
but when it comes to b and m it does not become zero. Hence, it returns true.
First, void is not a valid return type for main, unless you are developing for Plan 9.
Second, you should get into the habit of checking the return value of scanf() and all input functions in general. You can't rely on the value of T if the user does not input a number, because T is uninitialised. On that same note, you shouldn't use scanf with an unbounded %s scan operation. If the user enters 20 characters, this isn't going to fit into the ten character buffer that you have. An alternative approach is to use fgets to get a whole line of text at once, or, to use a bounded scan operation. If your array fits 10 characters (including the null terminator) then you can use scanf("%9s", W).
Third, single-character variable names are often very hard to understand. Instead of W, use word, instead of T, use testCount or something similar. This means that someone looking at your code for the first time can more easily work out what each variable is used for.
Most importantly, think about the process in your head, and maybe jot it down on paper. How would you solve this problem yourself? As an example, starting with n = 1,
Take the first n characters from the string.
Compare it to the last n characters from the string
Do they match?
If yes, print out the first n characters as the suffix and stop processing.
If no, increment n and try again. Try until n is in the middle of the string.
There are a few other things to think about as well, do you want the biggest match? For example, in the input string ababcdabab, the prefix ab is also the suffix, but the same can be said about abab. In this case, you don't want to stop processing, you want to keep going even if you find a prefix, so, you should just store the length of the largest prefix that is also the suffix.
Second-most-importantly, running into hurdles like this is incredibly common when learning C, so don't let this put a dampener on your enthusiasm, just keep trying!

Tool functions for chars

I want to handle some char variables and would like to get a list of some functions that can do these tasks when it comes to handling chars.
Getting first characters of a char (var_name[1] doesnt seem to work)
Getting last characters of a char
Checking for char1 matches with char2 ( eg if "unicorn" matches words with "bicycle"
I am pretty sure some of these methods exist in libraries such as stdio.h or so but google isnt my friend.
EDIT:My 3rd question means not direct match with strcmp but single character match(eg if "hey" and "hello") have e as common letter.
Use var_name[0] to get first character (array indexes run from 0 to N - 1, where N is the number of elements in the array).
Use var_name[strlen(var_name) - 1] to get the last character.
Use strcmp() to compare two char strings.
EDIT:
To search for character in a string you can use strchr():
if (strchr("hello", 'e') && strchr("hey", 'e'))
{
}
There is also strpbrk() function that would indicate if two strings have any common characters:
if (strpbrk("hello", "hey"))
{
}
Assuming you mean a char[], and not a char which is a single character.
C uses 0-based indexing, var_name[0] gives you the first char.
strlen() gives you the length of the string, which together with my answer to 1. means
char lastchar = var_name[strlen(var_name)-1]; http://www.cplusplus.com/reference/clibrary/cstring/strlen/
strcmp(var_name1, var_name2) == 0. http://www.cplusplus.com/reference/clibrary/cstring/strcmp/
I am pretty sure some of these methods exist in libraries such as
stdio.h or so but google isnt my friend.
The string functions in the C standard library (libc) are described in the header file . If you're on a unix-ish machine, try typing man 3 string at a command line. You can then use the man program again to get more information about specific functions, e.g. man 3 strlen. (The '3' just tells man to look in "section 3", which describes the C standard library functions.)
What you're looking for is the string functions in the C runtime library. These are defined in string.h, not stdio.h.
But your list of problems is simple:
var_name[0] works perfectly well for accessing the first char in an array. var_name[ 1] doesn't work because arrays in C are zero-based.
The last char in an array is:
char c;
c = var_name[strlen(var_name)-1];
Testing for equality is simple:
if (var_name[0] == var_name[1])
; // they match
C and C++ strings are zero indexed. The memory you need to hold a particular length string has to be at least the string length and one character for the string terminator \0. So, the first character is array[0].
As #Carey Gregory said, the basic string handling functions are in string.h. But these are only primitives for handling strings. C is a low level enough language, that you have an opportunity to build up your own string handling library based on the functions in string.h.
On example might be that you want to pass a string pointer to a function and also the length of the buffer holding that sane string, not just the string length itself.

Resources