Parse data from input string - c

I'm having this kind of input data.
<html>......
<!-- OK -->
I only want to extract the data before the comment sign <!--.
This is my code:
char *parse_data(char *input) {
char *parsed_data = malloc(strlen(input) * sizeof(char));
sscanf(input, "%s<!--%*s", parsed_data);
return parsed_data;
}
However, it doesn't seem to return the expected result. I can't figure out why is that so.
Could anyone explain me the proper way to extract this kind of data and the behavior of 'sscanf()`.
Thank you!

The "%s" format specifier will not treat "<!--" as a single delimiter, or any of the individual characters as a delimiter (which would not be the correct behaviour anyway). Only whitespace is considered a delimiter. Scan sets are available in sscanf() but they take a collection of individual characters rather that a sequence of characters representing a single delimiter. This means that everything in input before the first whitespace character will be assigned to parsed_data.
You could use strstr() instead:
const char* comment_start = strstr(input, "<!--");
char* result = 0;
if (comment_start)
{
result = malloc(comment_start - input + 1);
memcpy(result, input, comment_start - input);
result[comment_start - input] = 0;
}
Note that sizeof(char) is guaranteed to be 1 so can be omitted as part of the malloc() argument calculation.

Related

getting a token between two different delimiters

I need to get the last character from a string. Say the string looks like this:
blue;5
I was thinking I could use strlen and then just subtract by 1 to get the 5. I have tried a bunch of different ways but none of them work. That's the way I have the way I think it should look or do, but I know that its not working. Any suggestions? This is sort of my code-pseudocode. I know it doesn't work for a variety of reasons but its sort of the flow I had in mind.
len = strlen(Input);
Position = Input[len - 1];
strcpy(value, Input[Position]);
len = strlen(Input); //ok
Here you are going wrong . Putting a character into integer is incorrect. You need this.
Position = Input[len - 1]; //incorrect
Do it as
Position = strlen(Input) - 1 //correct
strcpy(value, &Input[Position]);//ok
If you really want the last character then you can use strlen() but not like that, instead like this
char string[] = "blue;5";
int position = strlen(string) - 1;
char last = string[position];
printf("%c\n", last);
note that last is not a string, but a single character which in turn is the ascii value for the last character in string, you can print it's representation using the "%c" printf() specifier.
#Iharob has already posted some code that lets you access the last character as a character. But if you want a string, you can do this because it is at the end, and so NUL-terminated:
const char * lastword = string + (strlen(string) - 1);
printf("%s", lastword);
Note the '%s' - lastword is a "string", not a character. It's just a string that is one letter long.
You are closer to the solution than you think:
len = strlen(Input);
strcpy(value, &Input[len - 1]); // copy last character
strcpy needs a pointer to the last char.

Removing a target character from a string using sscanf

I've recently been learning about different conversion specifiers, but I am struggling to use one of the more complex conversion specifiers. The one in question being the bracket specifier (%[set]).
To my understanding, from what I've read, using %[set] where any string matching the sequence of characters in set (the scanset) is consumed and assigned, and using %[^set] has the opposite effect; in essence consuming and assigning any string that does not contain the sequence of characters in the scanset.
That's my understanding, albeit roughly explained. I was trying to use this specifier with sscanf to remove a specified character from a string using sscanf:
sscanf(str_1, "%[^#]", str_2);
Suppose that str_1 contains "OH#989". My intention is to store this string in str_2, but removing the hash character in the process. However, sscanf stops reading at the hash character, storing only "OH" when I am intending to store "OH989".
Am I using the correct method in the wrong way, or am I using the wrong method altogether? How can I correctly remove/extract a specified character from a string using sscanf? I know this is possible to achieve with other functions and operators, but ideally I am hoping to use sscanf.
The scanset matches a sequence of (one or more) characters that either do or don't match the contents of the scanset brackets. It stops when it comes across the first character that isn't in the scanset. To get the two parts of your string, you'd need to use something like:
sscanf(str_1, "%[^#]#%[^#]", str_2, str_3);
We can negotiate on the second conversion specification; it might be that %s is sufficient, or some other scanset is appropriate. But this would give you the 'before #' and 'after #' strings that could then be concatenated to give the desired result string.
I guess, if you really want to use sscanf for the purpose of removing a single target character, you could do this:
char str_2[strlen(str_1) + 1];
if (sscanf(str_1, "%[^#]", str_2) == 1) {
size_t len = strlen(str_2);
/* must verify if a '#' was found at all */
if (str_1[len] != '\0') {
strcpy(str_2 + len, str_1 + len + 1);
}
} else {
/* '#' is the first character */
strcpy(str_2, str_1 + 1);
}
As you can see, sscanf is not the right tool for the job, because it has many quirks and shortcomings. A simple loop is more efficient and less error prone. You could also parse str_1 into 2 separate strings with sscanf(str_1, "%[^#]#%[\001-\377]", str_2, str_3); and deal with the 3 possible return values:
char str_2[strlen(str_1) + 1];
char str_3[strlen(str_1) + 1];
switch (sscanf(str_1, "%[^#]#%[\001-\377]", str_2, str_3)) {
case 0: /* empty string or initial '#' */
strcpy(str_2, str_1 + (str_1[0] == '#'));
break;
case 1: /* no '#' or no trailing part */
break;
case 2: /* general case */
strcat(str_2, str_3);
break;
}
/* str_2 hold the result */
Removing a target character from a string using sscanf
sscanf() is not the best tool for this task, see far below.
// Not elegant code
// Width limits omitted for brevity.
str_2[0] = '\0';
char *p = str_2;
// Test for the end of the string
while (*str_1) {
int n; // record stopping offset
int cnt = sscanf(str_1, "%[^#]%n", p, &n);
if (cnt == 0) { // first character is a #
str_1++; // advance to next
} else {
str_1 += n; // advance n characters
p += n;
}
}
Simple loop:
Remove the needles from a haystack and save the hay in a bail.
char needle = '#';
assert(needle);
do {
while (*haystack == needle) haystack++;
} while (*bail++ = *haystack++);
With the 2nd method, code could use haystack = bail = str_1

Using sscanf to read strings

I am trying to save one character and 2 strings into variables.
I use sscanf to read strings with the following form :
N "OldName" "NewName"
What I want : char character = 'N' , char* old_name = "OldName" , char* new_name = "NewName" .
This is how I am trying to do it :
sscanf(mystring,"%c %s %s",&character,old_name,new_name);
printf("%c %s %s",character,old_name,new_name);
The problem is , my problem stops working without any outputs .
(I want to ignore the quotation marks too and save only its content)
When you do
char* new_name = "NewName";
you make the pointer new_name point to the read-only string array containing the constant string literal. The array contains exactly 8 characters (the letters of the string plus the terminator).
First of all, using that pointer as a destination for scanf will cause scanf to write to the read-only array, which leads to undefined behavior. And if you give a string longer than 7 character then scanf will also attempt to write out of bounds, again leading to undefined behavior.
The simple solution is to use actual arrays, and not pointers, and to also tell scanf to not read more than can fit in the arrays. Like this:
char old_name[64]; // Space for 63 characters plus string terminator
char new_name[64];
sscanf(mystring,"%c %63s %63s",&character,old_name,new_name);
To skip the quotation marks you have a couple of choices: Either use pointers and pointer arithmetic to skip the leading quote, and then set the string terminator at the place of the last quote to "remove" it. Another solution is to move the string to overwrite the leading quote, and then do as the previous solution to remove the last quote.
Or you could rely on the limited pattern-matching capabilities of scanf (and family):
sscanf(mystring,"%c \"%63s\" \"%63s\"",&character,old_name,new_name);
Note that the above sscanf call will work iff the string actually includes the quotes.
Second note: As said in the comment by Cool Guy, the above won't actually work since scanf is greedy. It will read until the end of the file/string or a white-space, so it won't actually stop reading at the closing double quote. The only working solution using scanf and family is the one below.
Also note that scanf and family, when reading string using "%s" stops reading on white-space, so if the string is "New Name" then it won't work either. If this is the case, then you either need to manually parse the string, or use the odd "%[" format, something like
sscanf(mystring,"%c \"%63[^\"]\" \"%63[^\"]\"",&character,old_name,new_name);
You must allocate space for your strings, e.g:
char* old_name = malloc(128);
char* new_name = malloc(128);
Or using arrays
char old_name[128] = {0};
char new_name[128] = {0};
In case of malloc you also have to free the space before the end of your program.
free(old_name);
free(new_name);
Updated:...
The other answers provide good methods of creating memory as well as how to read the example input into buffers. There are two additional items that may help:
1) You expressed that you want to ignore the quotation marks too.
2) Reading first & last names when separated with space. (example input is not)
As #Joachim points out, because scanf and family stop scanning on a space with the %s format specifier, a name that includes a space such as "firstname lastname" will not be read in completely. There are several ways to address this. Here are two:
Method 1: tokenizing your input.
Tokenizing a string breaks it into sections separated by delimiters. Your string input examples for instance are separated by at least 3 usable delimiters: space: " ", double quote: ", and newline: \n characters. fgets() and strtok() can be used to read in the desired content while at the same time strip off any undesired characters. If done correctly, this method can preserve the content (even spaces) while removing delimiters such as ". A very simple example of the concept below includes the following steps:
1) reading stdin into a line buffer with fgets(...)
2) parse the input using strtok(...).
Note: This is an illustrative, bare-bones implementation, sequentially coded to match your input examples (with spaces) and includes none of the error checking/handling that would normally be included.
int main(void)
{
char line[128];
char delim[] = {"\n\""};//parse using only newline and double quote
char *tok;
char letter;
char old_name[64]; // Space for 63 characters plus string terminator
char new_name[64];
fgets(line, 128, stdin);
tok = strtok(line, delim); //consume 1st " and get token 1
if(tok) letter = tok[0]; //assign letter
tok = strtok(NULL, delim); //consume 2nd " and get token 2
if(tok) strcpy(old_name, tok); //copy tok to old name
tok = strtok(NULL, delim); //consume 3rd " throw away token 3
tok = strtok(NULL, delim); //consume 4th " and get token 4
if(tok) strcpy(new_name, tok); //copy tok to new name
printf("%c %s %s\n", letter, old_name, new_name);
return 0;
}
Note: as written, this example (as do most strtok(...) implementations) require very narrowly defined input. In this case input must be no longer than 127 characters, comprised of a single character followed by space(s) then a double quoted string followed by more space(s) then another double quoted string, as defined by your example:
N "OldName" "NewName"
The following input will also work in the above example:
N "old name" "new name"
N "old name" "new name"
Note also about this example, some consider strtok() broken, while others suggest avoiding its use. I suggest using it sparingly, and only in single threaded applications.
Method 2: walking the string.
A C string is just an array of char terminated with a NULL character. By selectively copying some characters into another string, while bypassing the one you do not want (such as the "), you can effectively strip unwanted characters from your input. Here is an example function that will do this:
char * strip_ch(char *str, char ch)
{
char *from, *to;
char *dup = strdup(str);//make a copy of input
if(dup)
{
from = to = dup;//set working pointers equal to pointer to input
for (from; *from != '\0'; from++)//walk through input string
{
*to = *from;//set destination pointer to original pointer
if (*to != ch) to++;//test - increment only if not char to strip
//otherwise, leave it so next char will replace
}
*to = '\0';//replace the NULL terminator
strcpy(str, dup);
free(dup);
}
return str;
}
Example use case:
int main(void)
{
char line[128] = {"start"};
while(strstr(line, "quit") == NULL)
{
printf("Enter string (\"quit\" to leave) and hit <ENTER>:");
fgets(line, 128, stdin);
sprintf(line, "%s\n", strip_ch(line, '"'));
printf("%s", line);
}
return 0;
}

Tokenizing a phone number in C

I'm trying to tokenize a phone number and split it into two arrays. It starts out in a string in the form of "(515) 555-5555". I'm looking to tokenize the area code, the first 3 digits, and the last 4 digits. The area code I would store in one array, and the other 7 digits in another one. Both arrays are to hold just the numbers themselves.
My code seems to work... sort of. The issue is when I print the two storage arrays, I find some quirks;
My array aCode; it stores the first 3 digits as I ask it to, but then it also prints some garbage values notched at the end. I walked through it in the debugger, and the array only stores what I'm asking it to store- the 515. So how come it's printing those garbage values? What gives?
My array aNum; I can append the tokens I need to the end of it, the only problem is I end up with an extra space at the front (which makes sense; I'm adding on to an empty array, ie adding on to empty space). I modify the code to only hold 7 variables just to mess around, I step into the debugger, and it tells me that the array holds and empty space and 6 of the digits I need- there's no room for the last one. Yet when I print it, the space AND all 7 digits are printed. How does that happen?
And how could I set up my strtok function so that it first copies the 3 digits before the "-", then appends to that the last 4 I need? All examples of tokenization I've seen utilize a while loop, which would mean I'd have to choose either strcat or strcpy to complete my task. I can set up an "if" statement to check for the size of the current token each time, but that seems too crude to me and I feel like there's a simpler method to this. Thanks all!
int main() {
char phoneNum[]= "(515) 555-5555";
char aCode[3];
char aNum[7];
char *numPtr;
numPtr = strtok(phoneNum, " ");
strncpy(aCode, &numPtr[1], 3);
printf("%s\n", aCode);
numPtr = strtok(&phoneNum[6], "-");
while (numPtr != NULL) {
strcat(aNum, numPtr);
numPtr = strtok(NULL, "-");
}
printf("%s", aNum);
}
I can primarily see two errors,
Being an array of 3 chars, aCode is not null-terminated here. Using it as an argument to %s format specifier in printf() invokes undefined behaviour. Same thing in a differrent way for aNum, too.
strcat() expects a null-terminated array for both the arguments. aNum is not null-terminated, when used for the first time, will result in UB, too. Always initialize your local variables.
Also, see other answers for a complete bug-free code.
The biggest problem in your code is undefined behavior: since you are reading a three-character constant into a three-character array, you have left no space for null terminator.
Since you are tokenizing a value in a very specific format of fixed length, you could get away with a very concise implementation that employs sscanf:
char *phoneNum = "(515) 555-5555";
char aCode[3+1];
char aNum[7+1];
sscanf(phoneNum, "(%3[0-9]) %3[0-9]-%4[0-9]", aCode, aNum, &aNum[3]);
printf("%s %s", aCode, aNum);
This solution passes the format (###) ###-#### directly to sscanf, and tells the function where each value needs to be placed. The only "trick" used above is passing &aNum[3] for the last argument, instructing sscanf to place data for the third segment into the same storage as the second segment, but starting at position 3.
Demo.
Your code has multiple issues
You allocate the wrong size for aCode, you should add 1 for the nul terminator byte and initialize the whole array to '\0' to ensure end of lines.
char aCode[4] = {'\0'};
You don't check if strtok() returns NULL.
numPtr = strtok(phoneNum, " ");
strncpy(aCode, &numPtr[1], 3);
Point 1, applies to aNum in strcat(aNum, numPtr) which will also fail because aNum is not yet initialized at the first call.
Subsequent calls to strtok() must have NULL as the first parameter, hence
numPtr = strtok(&phoneNum[6], "-");
is wrong, it should be
numPtr = strtok(NULL, "-");
Other answers have already mentioned the major issue, which is insufficient space in aCode and aNum for the terminating NUL character. The sscanf answer is also the cleanest for solving the problem, but given the restriction of using strtok, here's one possible solution to consider:
char phone_number[]= "(515) 555-1234";
char area[3+1] = "";
char digits[7+1] = "";
const char *separators = " (-)";
char *p = strtok(phone_number, separators);
if (p) {
int len = 0;
(void) snprintf(area, sizeof(area), "%s", p);
while (len < sizeof(digits) && (p = strtok(NULL, separators))) {
len += snprintf(digits + len, sizeof(digits) - len, "%s", p);
}
}
(void) printf("(%s) %s\n", area, digits);

Using fgets() in file parsing

I have a file which contains several lines.
I am tokenizing the file, and if the token contains contains .word, I would like to store the rest of the line in c-string.
So if:
array: .word 0:10
I would like to store 0:10 in a c-string.
I am doing the following:
if (strstr(token, ".word")) {
char data_line[MAX_LINE_LENGTH + 1];
int word_ret = fgets(data_line, MAX_LINE_LENGTH, fptr);
printf(".word is %s\n", data_line);
}
The problem with this is that fgets() grabs the next line. How would I grab the remainder of the current line? Is that possible?
Thank you,
strstr() returns a pointer to where the first character of ":word" is found.
This means that if you add the length of ":word" (5 characters) to that, you will get a pointer to the characters after ":word", which is the string you want.
char *x = strstr(token, ".word");
char *string_wanted = x + 5;
First of all it is obvious that you need to use fgets only once for every line you parse and then work with a buffer where the line is stored.
Next having a whole line you have several choices: if the string format is fixed (something like " .word") then you may use the result of "strstr" function to locate the start of ".word", advance 6 characters (including space) from it and print the required word from the found position.
Another option is more complex but in fact is a liitle bit better. It is using "strtok" function.
You need to have already read the input into a buffer, which I'm assuming is token, and from there you just copy from the return value of strstr + the length of ".word" to the end of the buffer. This is what I'd do:
char *location = strstr(token, ".word");
if (location != NULL) {
char data_line[MAX_LINE_LENGTH];
strncpy(data_line, location + 5, MAX_LINE_LENGTH);
printf(".word is %s\n", data_line);
}
You could add 5 or 6 to the pointer location (depending on whether or not there's going to be a space after ".word") to get the rest of the line.
Also note that the size parameter in strncpy and fgets includes space for the terminating NUL character.

Resources