Checking for a blank line in C - Regex - c

Goal:
Find if a string contains a blank line. Whether it be '\n\n',
'\r\n\r\n', '\r\n\n', '\n\r\n'
Issues:
I don't think my current regex for finding '\n\n' is right. This is my first time really using regex outside of simple use of * when removing files in command line.
Is it possible to check for all of these cases (listed above) in one regex? or do I have to do 4 seperate calls to compile_regex?
Code:
int checkForBlankLine(char *reader) {
regex_t r;
compile_regex(&r, "*\n\n");
match_regex(&r, reader);
return 0;
}
void compile_regex(regex_t *r, char *matchText) {
int status;
regcomp(r, matchText, 0);
}
int match_regex(regex_t *r, char *reader) {
regmatch_t match[1];
int nomatch = regexec(r, reader, 1, match, 0);
if (nomatch) {
printf("No matches.\n");
} else {
printf("MATCH!\n");
}
return 0;
}
Notes:
I only need to worry about finding one blank line, that's why my regmatch_t match[1] is only one item long
reader is the char array containing the text I am checking for a blank line.
I have seen other examples and tried to base the code off of those examples, but I still seem to be missing something.
Thank you kindly for the help/advice.
If anything needs to be clarified please let me know.

It seems that you have to compile the regex as extended:
regcomp(&re, "\r?\n\r?\n", REG_EXTENDED);
The first atom, \r? is probably unnecessary, because it doesn't add to the blank-line condition if you don't capture the result.
In the above, blank line really means empty line. If you want blank line to mean a line that has no characters except for white space, you can use:
regcomp(&re, "\r?\n[ \t]*\r?\n", REG_EXTENDED);
(I don't think you can use the space character pattern, \s here instead of [ \t], because that would include carriage return and new-line.)
As others have already hinted at, the "simple use of * in the command line` is not a regular expression. This wildcard-matching is called file globbing and has different semantics.

Check what the * in a regex means. It's not like the wildcard "anything" in the command line. The * means that the previous component can appear any amount of times. The wildcard in regex is the .. So if you want to say match anything you can do .*, which would be anything, any amount of times.
So in your case you can do .*\n\n.* which would match anything that has \n\n.
Finally, you can use or in a regex and ( ) to group stuff. So you can do something like .*(\n\n|\r\n\r\n).* And that would match anything that has a \n\n or a \r\n\r\n.
Hope that helps.

Rather than looking for only \r or \n, look for not \r or \n?
Your regex would simply be
'[^\r\n]'
and a match result of false indicates a blank line to your specification.

Related

Parsing the payload of the AT commands from the full response string

I want to parse the actual payload from the output of AT commands.
For instance: in the example below, I'd want to read only "2021/11/16,11:12:14-32,0"
AT+QLTS=1 // command
+QLTS: "2021/11/16,11:12:14-32,0" // response
OK
In the following case, I'd need to only read 12345678.
AT+CIMI // command
12345678 // example response
So the point is: not all commands have the same format for the output. We can assume the response is stored in a string array.
I have GetAtCmdRsp() already implemented which stores the response in a char array.
void GetPayload()
{
char rsp[100] = {0};
GetAtCmdRsp("AT+QLTS=1", rsp);
// rsp now contains +QLTS: "2021/11/16,11:12:14-32,0"
// now, I need to parse "2021/11/16,11:12:14-32,0" out of the response
memset(rsp, 0, sizeof(rsp));
GetAtCmdRsp("AT+CIMI", rsp);
// rsp now contains 12345678
// no need to do additional parsing since the output already contains the value I need
}
I was thinking of doing char *start = strstr(rsp, ":") + 1; to get the start of the payload but some responses may only contain the payload as it's the case with AT+CIMI
Perhaps could regex be a good idea to determine the pattern +<COMMAND>: in a string?
In order to parse AT command responses a good starting point is understanding all the possible formats they can have. So, rather than implementing a command specific routine, I would discriminate commands by "type of response":
Commands with no payload in their answers, for example
AT
OK
Commands with no header in their answers, such as
AT+CIMI
12345678
OK
Commands with a single header in their answers
AT+QLTS=1
+QLTS: "2021/11/16,11:12:14-32,0"
OK
Command with multi-line responses.Every line could of "single header" type, like in +CGDCONT:
AT+CDGCONT?
+CGDCONT: 1,"IP","epc.tmobile.com","0.0.0.0",0,0
+CGDCONT: 2,"IP","isp.cingular","0.0.0.0",0,0
+CGDCONT: 3,"IP","","0.0.0.0",0,0
OK
Or we could even have mixed types, like in +CGML:
AT+CMGL="ALL"
+CMGL: 1,"REC READ","+XXXXXXXXXX","","21/11/25,10:20:00+00"
Good morning! How are you?
+CMGL: 2,"REC READ","+XXXXXXXXXX","","21/11/25,10:33:33+00"
I'll come a little late. See you. Bruce Wayne
OK
(please note how it could have also "empty" lines, that is \r\n).
At the moment I cannot think about any other scenario.In this way you'll be able to define an enum like
typedef enum
{
AT_RESPONSE_TYPE_NO_RESPONSE,
AT_RESPONSE_TYPE_NO_HEADER,
AT_RESPONSE_TYPE_SINGLE_HEADER,
AT_RESPONSE_TYPE_MULTILINE,
AT_RESPONSE_TYPE_MAX
}
and pass it to your GetAtCmdRsp( ) function in order to parser the response accordingly. If implement the differentiation in that function, or after it (or in an external function is your choice.
A solution without explicit categorization
Once you have clear all the scenarios that might ever occur, you can think about a general algorithm working for all of them:
Get the full response resp after the command echo and before the closing OK or ERROR. Make sure that the trailing \r\n\r\nOK is removed (or \r\nERROR. Or \r\nNO CARRIER. Or whatever the terminating message of the response might be).Make also sure to remove the command echo
If strlen( resp ) == 0 we belong to the NO_RESPONSE category, and the job is done
If the response contains \r\ns in it, we have a MULTILINE answer. So, tokenize it and place every line into an array element resp_arr[i]. Make sure to remove trailing \r\n
For every line in the response (for every resp_arr[i] element), search for <CMD> : pattern (not only :, that might be contained in the payload as well!). Something like that:
size_t len = strlen( resp_cur_line );
char *payload;
if( strstr( "+YOURCMD: ", resp_cur_line) == NULL )
{
// We are in "NO_HEADER" case
payload = resp_cur_line;
}
else
{
// We are in "HEADER" case
payload = resp_cur_line + strlen( "+YOURCMD: " );
}
Now payload pointer points to the actual payload.
Please note how, in case of MULTILINE answer, after splitting the lines into array elements every loop will handle correctly also the mixed scenarios like the one in +CMGL, as you'll be able to distinguish the lines containing the header from those containing data (and from the empty lines, of course). For a deeper analysis about +CMGL response parsing have a look to this answer.

Using strtok to tokenize html

I'm looking to extract text found exactly within an < and >, while also extracting things found between > and <.
For instance:
<html> would just return <html>
<title>This is a title</title> would return <title>, This is a title, </title>
This is a title would return This is a title
And finally <title>This is a weird use of < bracket</title> should return <title>, This is a weird use of < bracket, </title>. My current version recognises it as <title>, This is a weird use of, < bracket, </title>
I'd appreciate any snippets of code, or directions to head in to get to a solution.
tldr, grab substrings with <...> and >...< seperately without being stumped by a floating ...>... or ...<....
Edit: not using strtok anymore, would appreciate any other help or similiar problems you may know about. Any thing to read also would be greatly beneficial. Note: we aren't trying to parse, simply lex the input string
Can only use standard libraries for c.
Just trying to build a basic validator for a subset of valid HTML.
You can't, not even a basic one. You will have too many false positives and negatives. Here's a simple example.
<tag attribute=">" />
HTML has many features which do not allow simple parsing. It is...
Balanced, like <tag></tag> and also "quotes".
Nested, like <tag><tag></tag></tag>.
Escaped, like "escaped\"quote".
Has other languages embedded in it, like Javascript and CSS.
If this is an exercise in tokenization, you could define a very specific subset, but I'd suggest something simpler like JSON which has a well defined grammar. Those are typically parsed using a lexer and parser, but JSON is small enough to be written by hand.
My own solution has been thus so far,
as suggested by #chqrlie...
void tokenize(char* stringPtr)
{
char *flag;
strcpy(flag, " ");
/*We build this up as we iterate the string.
Strtok was not suitable, build up tokens char by char */
char tempToken[tokenLength];
strcpy(tempToken, ""); // Init current token
// Traverse string catching stuff between <...> and >...< seperately.
for(int i =0; i<strlen(stringPtr);i++)
{
if (stringPtr[i]=='<' )
{
if (strcmp(flag, " ")==0)
{
putToken(tempToken);
strcpy(tempToken,""); // Tag starting, everything before it is a token.
strcpy(flag,"<");
strcat(tempToken, flag);
}
else // Catches <...<
{
presentError(stringPtr);
}
}
else if (stringPtr[i]=='>')
{
if (strcmp(flag,"<")==0)
{
strcat(tempToken, ">");
strcpy(flag," ");
putToken(tempToken);
strcpy(tempToken,"");
}
else // Cant have a > unless we saw < already
{
presentError(stringPtr);
}
}
else // Manage non angle brackets
{
strncat(tempToken, &stringPtr[i],1 );
}
}
putToken(tempToken); // Catches a line ending in a value, not a tag
/* Notes
Floating <'s and >'s will be errored up
- Special case ....<...>..., which is incorrect
will cause floating tokens, can be identified
Unclosed tags i.e. </p will be tokenized verbatim,
thus can identify this mistake
Unopened tags i.e. p> will be errored
*/
}
Assume that presentError() terminates lexing.
Some improvements can be made, I'm open to suggestions however this is a first working draft.

Markup matches and escape special characters outside the matches at the same time

I have a search functionality for a treeview that highlights all matches, incl. distinction between caseless and case-sensitive, as well as distinction between regular expression and literal. However, I have a problem when the current cell contains special characters that are not part of the matches. Consider the following text inside a treeview cell:
father & mother
Now I want to do for example a search on the whole treeview for the letter 'e'. For highlighting the matches only and not the whole cell, I need to use markup. To achieve this, I use g_regex_replace_eval and its callback function in the way as stated inside the GLib documentation. The resulting new marked up text for the cell would be like this:
fath<span background='yellow' foreground='black'>e</span>r &
moth<span background='yellow' foreground='black'>e</span>r
If there are special characters inside the matches, they are escaped before being added to the hashtable that is used by the eval function. So special characters inside matches are no problem.
But I have the '&' now outside the markup parts, and it has to be changed to &, otherwise the markup won't show up in the cell and a warning
Failed to set text from markup due to error parsing markup: Error on line x: Entity did not end with a semicolon; most likely you used an ampersand character without intending to start an entity - escape ampersand as &
will be shown inside the terminal.
If I use g_markup_escape_text on the new cell text, it will obviously not only escape the '&', but also the '<' and '>' of the markup, so this is no solution.
Is there a reasonable way to put markup around the matches and escape special characters outside the markup at the same time or with a view steps? Everything I could imagine so far is much too complicated, if it would work at all.
Even though I had already considered Philip's suggestion in most of its parts before asking my question, I had not touched yet the subject of utf8, so he gave an important hint for the solution. The following is the core of a working implementation:
gchar *counter_char = original_cell_txt; // counter_char will move through all the characters of original_cell_txt.
gint counter;
gunichar unichar;
gchar utf8_char[6]; // Six bytes is the buffer size needed later by g_unichar_to_utf8 ().
gint utf8_length;
gchar *utf8_escaped;
enum { START_POS, END_POS };
GArray *positions[2];
positions[START_POS] = g_array_new (FALSE, FALSE, sizeof (gint));
positions[END_POS] = g_array_new (FALSE, FALSE, sizeof (gint));
gint start_position, end_position;
txt_with_markup = g_string_new ("");
g_regex_match (regex, original_cell_txt, 0, &match_info);
while (g_match_info_matches (match_info)) {
g_match_info_fetch_pos (match_info, 0, &start_position, &end_position);
g_array_append_val (positions[START_POS], start_position);
g_array_append_val (positions[END_POS], end_position);
g_match_info_next (match_info, NULL);
}
do {
unichar = g_utf8_get_char (counter_char);
counter = counter_char - original_cell_txt; // pointer arithmetic
if (counter == g_array_index (positions[END_POS], gint, 0)) {
txt_with_markup = g_string_append (txt_with_markup, "</span>");
// It's simpler to always access the first element instead of looping through the whole array.
g_array_remove_index (positions[END_POS], 0);
}
/*
No "else if" is used here, since if there is a search for a single character going on and
such a character appears double as 'm' in "command", between both m's a span tag has to be
closed and opened at the same position.
*/
if (counter == g_array_index (positions[START_POS], gint, 0)) {
txt_with_markup = g_string_append (txt_with_markup, "<span background='yellow' foreground='black'>");
// See the comment for the similar instruction above.
g_array_remove_index (positions[START_POS], 0);
}
utf8_length = g_unichar_to_utf8 (unichar, utf8_char);
/*
Instead of using a switch statement to check whether the current character needs to be escaped,
for simplicity the character is sent to the escape function regardless of whether there will be
any escaping done by it or not.
*/
utf8_escaped = g_markup_escape_text (utf8_char, utf8_length);
txt_with_markup = g_string_append (txt_with_markup, utf8_escaped);
// Cleanup
g_free (utf8_escaped);
counter_char = g_utf8_find_next_char (counter_char, NULL);
} while (*counter_char != '\0');
/*
There is a '</span>' to set at the end; because the end position is one position after the string size
this couldn't be done inside the preceding loop.
*/
if (positions[END_POS]->len) {
g_string_append (txt_with_markup, "</span>");
}
g_object_set (txt_renderer, "markup", txt_with_markup->str, NULL);
// Cleanup
g_regex_unref (regex);
g_match_info_free (match_info);
g_array_free (positions[START_POS], TRUE);
g_array_free (positions[END_POS], TRUE);
Probably the way to do this is to not use g_regex_replace_eval(), but rather to use g_regex_match_all() to get the list of matches for a string. Then you need to step through the string character-by-character (do this using the g_utf8_*() functions, since this has to be Unicode-aware). If you get to a character which needs to be escaped (<, >, &, ", '), output the escaped entity for it. When you get to a match position, output the correct markup for it.
I'd escape the whole text first using g_markup_escape_text, then escape the text to search and use it in g_regex_replace_eval. This way escaped text can be matched, and text not matched is already escaped.

Recognising multiple new_lines in PEGKit

I am learning how to use PEGKit, but am running into problem with creating a grammar for a script that parses lines, even when they are separated by multiple line break characters. I have reduced the problem to this grammar:
expr
#before {
PKTokenizer *t = self.tokenizer;
self.silentlyConsumesWhitespace = NO;
t.whitespaceState.reportsWhitespaceTokens = YES;
self.assembly.preservesWhitespaceTokens = YES;
}
= Word nl*;
nl = nl_char nl_char*;
nl_char = '\n'! | '\r'!;
This simple grammar to me should allow one word per line, with as many line breaks as necessary. But it only allows one word with an optional line break. Does anybody know what's wrong here? Thank you.
Creator of PEGKit here.
Try the following grammar instead (make sure you are using HEAD of master):
#before {
PKTokenizer *t = self.tokenizer;
[t.whitespaceState setWhitespaceChars:NO from:'\\n' to:'\\n'];
[t.whitespaceState setWhitespaceChars:NO from:'\\r' to:'\\r'];
[t setTokenizerState:t.symbolState from:'\\n' to:'\\n'];
[t setTokenizerState:t.symbolState from:'\\r' to:'\\r'];
}
lines = line+;
line = ~eol* eol+; // note the `~` Not unary operator. this means "zero or more NON eol tokens, followed by one or more eol token"
eol = '\n'! | '\r'!;
Note that here, I am tweaking the tokenizer to recogognize newlines and carriage returns as Symbols rather than whitespace. That makes it easier to match and discard them (they are discarded by the ! operator).
For another approach to the same problem using the builtin S whitespace rule, see here.

Trying to make match on a rule that uses "recursive" identifier in flex

I have this line:
0, 6 -> W(1) L(#);
or
\# -> #shift_right R W(1) L
I have to parse this line with flex, and take every element from every part of the arrow and put it in a list. I know how to match simple things, but I don't know how to match multiple things with the same rule. I'm not allowed to increase the limit for rules. I have a hint: parse the pieces, pieces will then combine, and I can use states, but I don't know how to do that, and I can't find examples on the net. Can someone help me?
So, here an example:
{
a -> W(b) #invert_loop;
b -> W(a) #invert_loop;
-> L(#)
}
When this section begins I have to create a structure for each line, where I put what is on the left of -> in a vector, those are some parameters, and the right side in a list, where each term is kinda another structure. For what is on the right side I wrote rules:
writex W([a-zA-Z0-9.#]) for W(anything).
So I need to parse these lines, so I can put the parameters and the structures int the big structure. Something like this(for the first line):
new bigStruc with param = a and list of struct = W(anything), #invert(it is a notation for a reference to another structure)
So what I need is to know how to parse these line so that I can create and create and fill these bigStruct, also using to rules for simple structure(i have all I need for these structures, but I don't how to parse so that I can use these methods).
Sorry for my English and I hope this time I was more clear on what I want.
Last-minute editing: I have matched the whole line with a rule, and then work on it with strtok. There is a way to use previous rules to see what type of structure i have to create? I mean not to stay and put a lots of if, but to use writex W([a-zA-Z0-9.#]) to know that i have to create that kind of structure?
Ok, lets see how this snippet works for you:
// these are exclusive rules, so they do not overlap, for inclusive rules, use %s
%x dataStructure
%x addRules
%%
<dataStructure>-> { BEGIN addRules; }
\{ { BEGIN dataStructure; }
<addRules>; { BEGIN dataStructure; }
<dataStructure>\} { BEGIN INITIAL; }
<dataStructure>[^,]+ { ECHO; } //this will output each comma separated token
<dataStructure>. { } //ignore anything else
<dataStructure>\n { } //ignore anything else
<addRules>[^ ]+ { ECHO; } //this will output each space separated rule
<addRules>. { } //ignore anything else
<addRules>\n { } //ignore anything else
%%
I'm not entirely sure what it it you want. Edit your original post to include the contents of your comments, with examples, and please structure your English better. If you can't explain what you want without contradicting yourself, I can't help you.

Resources